[Day 193] Chapter 5, 6, and 7 from Effective Data Science Infrastructure

Hello :)
Today is Day 193!

A quick summary of today:
covered chapter 5,6, and 7 from Effective Data Science Infrastructure

Chapter 5: Practicing scalability and performance

Effective infrastructure must accommodate a wide range of projects. Rather than adopting a one-size-fits-all approach, it should offer a versatile toolbox of robust methods to achieve adequate scalability and performance. To enhance organizational scalability and ensure projects are comprehensible to the largest audience, our primary strategy is simplicity. Given that people's understanding is limited overengineering and overoptimizing can cause extra costs.

Vertical scalability

it refers to the idea of handling more compute and larger datasets just by using larger instances

To start things, we begin with a skeleton flow, and then keep adding new things till we get to the final solution.

The model uses Yelp review data and the goal is to group reviews together to find what kind of reviews are generally posted on the platform

In metaflow we can set a flow to run in a fixed environment by adding something like:

@conda_base(python='3.8.3', libraries={'scikit-learn': '0.24.1'})

above the class definition

and when we run this file we need to add `--environment=conda` as an extra arg to tell metaflow that we want to use conda as env.

We can also let metaflow use more resources with this tag:

@resources(cpu=16, memory=4000)

it tells AWS to allocate 16 CPU cores and 4 GB of memory to the workflow step to ensure it has adequate computational resources

In contrast to the memory limit, the CPU limit is a soft limit on AWS Batch. The task can use all available CPU cores on the instance if there are no other tasks executing on it simultaneously. The limit applies when multiple tasks need to share the instance. In most cases, this behavior is desirable, but it makes benchmarking tricky, because a task with, for example, four CPUs, may end up using more CPU cores opportunistically.

Horizontal scalability

As a rule of thumb, you should consider horizontal scalability, that is, using multiple instances, if you answer “yes” to any of the following three questions:

Are there significant chunks of compute in your workflow that are embarrassingly parallel (little or no effort is needed to separate the problem into a number of parallel tasks), meaning they can perform an operation without sharing any data besides inputs?
Is the dataset size too large to be conveniently handled on the largest available instance type?
Are there compute-intensive algorithms, such as model training, that are too demanding to be executed on a single instance?

By default, Metaflow will run at most 16 tasks in parallel

PERFORMANCE TIP

Premature optimization is the root of all evil. Don’t worry about performance until you have exhausted all other easier options. If you must optimize performance, know when to stop.

Chapter 6: Going to production

“Deploying to production” means running workflows that are fully automated and highly available, starting and executing without human intervention and rarely failing, although specific use cases may have additional requirements such as low latency, handling massive datasets, or specific system integrations.

AWS Step Functions (SFN) is a highly available and scalable workflow orchestrator offered by AWS. It is a fully managed, maintenance-free service that integrates seamlessly with other AWS services, and is cost-competitive compared to in-house solutions. However, defining workflows manually using SFN's native JSON-based syntax is challenging, and the GUI is somewhat clunky. Additionally, SFN has limitations on workflow size, which may impact workflows with extensive parallel operations.

To run a flow in production: python some_flow.py step-functions create

And the first thing we see is:

When step-functions create was ran the following was executed

Metaflow packaged all Python code in the current working directory and uploaded it to the datastore (S3) so it can be executed remotely. More about this in the next section
Metaflow parsed the workflow DAG and translated it to a syntax that SFN understands. In other words, it converted your local Metaflow workflow to a bona fide SFN workflow
Metaflow made a number of calls to AWS APIs to deploy the translated workflow to the cloud

To trigger a run we can use: python some_flow.py step-functions trigger

The some_flow.py is a simple flow with start and end steps, and since we created and trigerred a run, in the AWS console we can see the execution details and also a graph view:

Also an event list:

And input/output

Using our terminal we can see some more infor about a specific flow
`python sfntest.py step-functions list-runs`

Also, we can set a flow to run on a schedule using the @schedule decorator

Use containers and the @conda decorator to manage third-party dependencies in production deployments.
User namespaces help isolate prototypes that users run on their local workstations, making sure that prototypes don’t interfere with each other.
Production deployments get a namespace of their own, isolated from prototypes. New users must obtain a production token to deploy new versions to production, which prevents accidental overwrites.
The @project decorator allows multiple parallel, isolated workflows to be deployed to production concurrently.
Use @project to create a unified namespace across multiple workflows.

Chapter 7: Processing data

This chapter is about loading data from different sources and thinking about the size of our project and what is the best option for a data warehouse - local? db? cloud?

Using local files:

Fast Loading: Loading data from a local file is significantly quicker than executing a SQL query
Stable Dataset: For effective prototyping, it's crucial that the dataset remains consistent. Systematic experimentation and iterations become impossible if the underlying data changes unexpectedly
User-Friendly: Loading data from local files doesn’t require special clients. It avoids random failures or slowdowns due to concurrent experiments by colleagues and is compatible with almost all off-the-shelf libraries

Unfortunately, local files have many downsides: they are unsuitable for production deployments or scaled-out experiments in the cloud and require manual updates, bypassing data security and governance policies. Using a cloud-based object store like AWS S3 can overcome these issues, making data compatible with cloud-based compute and governance, and with some effort, the user experience can be nearly as seamless as accessing local files, often even faster.

Comparison between common file types their loading types:

RECOMMENDATION

Whenever feasible, use the Parquet format to store and transfer tabular data instead of CSV files.

RECOMMENDATION

If memory consumption is a concern, avoid storing individual rows as Python objects. Converting to pandas can be costly as well. The most efficient option is to use Arrow and NumPy, if possible.

Interfacing with data infrastructure

Data: At the core, you have the data asset itself, stored as CSV files, Parquet files, or tables in a database.
Durable storage: Use durable storage systems like AWS S3 or replicated databases instead of a USB thumb drive; a modern data lake with a metadata layer like Apache Hive or Iceberg can facilitate access.
Query engine: A query engine processes SQL statements to select, filter, and join data; newer systems like Trino or Apache Spark are loosely coupled with storage, while streaming data can use systems like Apache Druid or Pinot.
Data loading and transformations: ETL is essential for data engineering; newer systems like Snowflake or Spark support ELT, and tools like DBT simplify data transformations, while tools like Great Expectations ensure data quality.
Workflow orchestrator: ETL pipelines are expressed as DAGs and executed by orchestrators like AWS Step Functions or Apache Airflow; using a centralized orchestrator can manage both data science and data workflows.
Data management: Increasing data complexity requires data management components like catalogues (e.g., Amundsen), governance systems for security and policies, and monitoring systems for overall data health and ETL pipeline quality.

As projects and teams grow the distinction between a data engineer and a data scientist becomes more unclear

The data engineer is responsible for data acquisition, that is, gathering raw data, data quality, and any data transformations that are needed to make the data available as widely consumable, authoritative, carefully curated datasets. In the case of structured data, the datasets are often tables with a fixed, stable schema. Notably, these upstream datasets should focus on facts—data that corresponds as closely as possible to directly observable raw data—leaving interpretations of data to downstream projects. The organization shouldn’t underestimate the demands of this role or undervalue it. This role is directly exposed to the chaos of the real world through raw data, so they play a critical role in insulating the rest of the organization from it. Correspondingly, the validity of all downstream projects depends on the quality of upstream data.

The data scientist focuses on building, deploying, and operating data science applications. They are intimately familiar with the specific needs of each project. They are responsible for creating project-specific tables based on the upstream tables. Because they are responsible for creating these tables, they can iterate on them independently as often as needed.

I definitely need to do some kind of project using metaflow so I can experience these things first hand. I tried to go over as much as possible today because I am getting charged by AWS as we speak haha. But when I decide to use metaflow for a project hopefully I can either set it up in GCP where I have free credits, or perfectly - just locally.

That is all for today!

See you tomorrow :)

Search This Blog

50+ days of Machine Learning