[Day 137] AWS Summit Seoul Day 2

Hello :)
Today is Day 137!

A quick summary of today:
attended 'data analysis' and 'business tech' sessions during Day 2 of AWS Summit in Seoul

I tried to take more notes with my laptop during the session today.

First session - 'Data strategy for successful GenAI on AWS'

A GenAI app is like an iceberg.

Under the water it consists of among other things data storage, database, data lake, data manipulation, data governance.

The session introduced us to RAG, and how we can either use an out-of-the-box language model, pre-train a model, or train a model from scratch.

As for a vector search AWS offers plenty of options: Amazon OpenSearch Service, OpenSearch Serverless, Aurora PostgreSQL, RDS for PostgreSQL, DocumentDB, DynamoDB via zero ETL, MemoryDB for redis, Neptune.

As for a database, there are plenty of structured and nonstructured dbs.

Amazon Bedrock - allows for using out-of-the-box models for GenAI applications. It gives us the ability to use a simple API for a model, allows for model customizing, RAG, Agents, protection, privacy and security.

If we use Amazon Bedrock, the vector databases we can use are Amazon OpenSearch Serverless, Redis Enterprise Cloud, Pinecone, Amazon Aurora, MongoDB.

Second session - 'How to best use vector databases on AWS'

Start with a pdf - chunk - embed with Amazon Tital - store in Amazon Aurora PostgreSQL (or other db)

Things we need to look for when embedding.

embedding time
size
query time

Questions to ask when choosing a db system

for my workflow, which vector storage is the best?
how much data am I going to store?
among storage, efficiency, connectiviy, cost which is important to me?
what are the pros and cons of the different indexing, query time and schema designs?

What is pgvector?

offers open source similarity search, vector storage, indexing and metadata

From pgvector's github it has 2 strategies for indexes:

HNSW: An HNSW index creates a multilayer graph. It has better query performance than IVFFlat (in terms of speed-recall tradeoff), but has slower build times and uses more memory. Also, an index can be created without any data in the table since there isn’t a training step like IVFFlat.

IVFFlat: An IVFFlat index divides vectors into lists, and then searches a subset of those lists that are closest to the query vector. It has faster build times and uses less memory than HNSW, but has lower query performance (in terms of speed-recall tradeoff).

Third session - 'Successful data stories from the business'

Presented many cases (companies) where after integrating snowflake into their operations, the companies reduced cost, increased customer satisfaction, profits, amount of data, customer service. Cases like KFC, AT&T, Pfizer, SCANIA.

Below is what kind of services are offered

Basicaully, it was a sales pitch for snowflake haha

Fourth session - Improving GenAI search performance through Amazon's OpenSearch

What is Amazon OpenSearch? The most optimized search and log solution

offers similarity search
seamless data analysis and visualisation
connects data sources
low cost solution
easy of use

Introduced vector search and TF-IDF scoring, and the process of giving a PDF, chunking, embedding, saving, finding similar to a query passages and retrieving them.

Using the ANN algorithm - Hierarchical navigable small world (HNSW), and IVF

Showed an example of semntic search:

external model hosting and service connection
creating an OpenSearch document gathering using NN search pipeline
search request through the API gateway
call the backend through the API gateway
search the backend for related documents and after give them to the client

We saw a demo on using Amazon OpenSearch and Bedrock for text2sql - similar to my chat with your PDFs but of course on a much higher level using OpenSearch and using streamlit. Also every query, response and extra metadata gets saved in the backend.

Below is what happens on a higher level

Fifth session (from the business tech sessions because there is no data analysis session in this time slot) - Easy data lake creation using AWS

What is a data lake? A place that stores all kinds of data (structured and unstructured)

Data lake vs Data warehous

Data saved in any format (raw) vs fixed format

Schema is not needed vs Schema is needed

Low trustability vs High trustability

Storage and volume cost are better vs Provides fast and efficient queries

Data lake key features (example - Amazon S3)

data catalogue and searching
user access
data gathering
manipulation and analysis

There are also data lake houses that combine the benefits of both lakes and houses

Things to consider for data lakes in finance

regulation compliance
multi-user support
user certification and ability to use/maintain
authorization
protection against attackers
private network connection and traffic

Data lake security levels

identity and access management
application security
data protection
infrastructure security
network security
identity and access management (platform)
multi-account management

AWS lake formation

Sixth session - 'Using Amazing SageMaker Canvas for GenAI'

Data prep challenges

too many different tools
various code dependancies
scalability
operations management

Data prep/manipulation to get a clean version of the data for a model can take 60-80% of the time of an ML project

SageMaker Canvas has us covered - new feature - chat for data prep

run code with plain text, edit/remove/add data, get insights, data prep process one with no code and ready to make a model

After our data is ready, there are ready to use models for text, image, audio content.

Finetuning? Can be done without writing any code as well. Selecting a model (or up to 3), data, and we get training and validation results and extra info if we want to finetune with python.

Great day. I am glad I attended ^^

That is all for today!

See you tomorrow :)

Search This Blog

50+ days of Machine Learning