[Day 219] Fundamentals of Data Eng and LLM data preprocessing pipelines in Mage

 Hello :)
Today is Day 219!


A quick summary of today:
  • starting 'Fundamentals of Data Engineering'
  • covered module 5: orchestration of the LLM zoomcamp



When I woke up today I saw that DeepLearning.AI is launching a new course - DE Professional Certificate at the end of August. The instructor - Joe Reis, is one of the writers of the infamous 'holy book' of DE - Fundamentals of Data Engineering.
Thankfully, I found the book officially published for free by Redpand. Below is a summary of what I read today (Chapter 1).

What is Data Engineering?

There is a lot of definitions of the term, but all have a similar idea. The book combines it all in this one:

Data engineering is the development, implementation, and maintenance of systems
and processes that take in raw data and produce high-quality, consistent information
that supports downstream use cases, such as analysis and machine learning. Data engi‐
neering is the intersection of security, data management, DataOps, data architecture,
orchestration, and software engineering. A data engineer manages the data engineering
lifecycle, beginning with getting data from source systems and ending with serving
data for use cases, such as analysis or machine learning.

The Data Engineering Lifecycle


Rather than fixating on the tech, this book focuses on the ideas and thinking behind the tech and the processes in the data's lifecycle. It summarises the lifecycle stages as:

  • generation
  • storage
  • ingestion
  • transformation
  • serving
Evolution of the Data Engineer

The early days: 1980 to 2000, from data warehousing to the web

Data engineering's roots lie in the development of data warehousing in the 1980s, spearheaded by innovations such as relational databases and SQL. The rise of the internet in the mid-1990s necessitated robust data systems for emerging web-first companies, leading to the evolution of scalable analytics and roles focused on data warehousing and business intelligence.

The early 2000s: The birth of contemporary data engineering

Post-dot-com bust, surviving tech companies like Yahoo, Google, and Amazon pushed traditional data systems to their limits, necessitating scalable, cost-effective solutions. This period saw the advent of big data, driven by innovations like Google's MapReduce and the emergence of Hadoop, along with the rise of public cloud services such as AWS, which revolutionized data storage and computing

The 2000s and 2010s: Big data engineering

The birth of open-source big data tools and the transition to real-time data streaming defined this era. Hadoop and its ecosystem enabled companies to handle massive data sets, although the hype around "big data" sometimes led to its misapplication. The focus shifted from maintaining complex infrastructures to delivering business insights, paving the way for today's data engineering practices.

The 2020s: Engineering for the data lifecycle

Data engineering has evolved towards modular, managed, and highly abstracted tools, forming the modern data stack. The role now emphasizes data lifecycle management, including security, governance, and compliance, while leveraging improved tools and techniques to handle diverse and growing data sources efficiently. This period marks a golden age of data management, focusing on agility and decentralization

Data Engineering and Data Science

There is some debate, but like I thought, the writers believe that DS is a downstream task of DE

Although data scientists might want to spend time modelling, most of their work happens at the bottom of this triangle (possibly due to lack of support from data engineers).

Data Engineering Skills and Activities

Data engineers used to focus on managing complex, monolithic technologies and low-level details, but now they utilize modern, simplified tools to create agile, cost-effective data solutions. They do not typically build ML models, create reports, perform data analysis, or develop software, though they must understand these areas to support stakeholders.

Data Maturity and the Data Engineer

The level of data engineering complexity within a company depends a great deal on the company’s data maturity. This significantly impacts a data engineer’s day-to-day job responsibilities.

Data maturity is the progression toward higher data utilization, capabilities, and integration across the organization, but data maturity does not simply depend on the age or revenue of a company. 

Stage 1: Starting with Data

  • Characteristics: Early data maturity, vague goals, nascent data architecture, low adoption, small data team
  • Data Engineer's Role: Generalist, aiming to move quickly and add value
  • Challenges: Avoid premature ML projects, prevent working in silos, and steer clear of unnecessary complexity

Stage 2: Scaling with Data
  • Characteristics: Transition from ad hoc data requests to formal practices, scaling architectures
  • Data Engineer's Role: Specialist, focusing on scalable, robust architectures and formal practices
  • Challenges: Avoid chasing bleeding-edge tech, focus on practical solutions, and lead pragmatic data initiatives

Stage 3: Leading with Data

  • Characteristics: Data-driven company with automated pipelines, self-service analytics, seamless data introduction
  • Data Engineer's Role: Deep specialist, ensuring data availability and leveraging data for competitive advantage
  • Challenges: Avoid complacency, prevent technology distractions, and focus on business value
Business Responsibilities

  • Know how to communicate with nontechnical and technical people
  • Understand how to scope and gather business and product requirements
  • Understand the cultural foundations of Agile, DevOps, and DataOps
  • Control costs
  • Learn continuously
Technical Responsibilities

A data engineer should know how to code and possess production-grade software engineering skills. Although managed services and SaaS have simplified some tasks, strong coding abilities and understanding of software engineering best practices are still crucial for handling complex technical needs and gaining a competitive edge.

Main languages: SQL, Python, Java, Scala, bash

Secondary languages: R, JavaScript, Go, Rust, C/C++, C#, and Julia

How do you keep your skills sharp in a rapidly changing field like data engineering?

Focus on the fundamentals to understand what’s not going to change; pay attention to ongoing developments to know where the field is going. New paradigms and practices are introduced all the time, and it’s incumbent on you to stay current. Strive to understand how new technologies will be helpful in the lifecycle.

The Continuum of Data Engineering Roles, from A to B

In Data Science there are type A and type B data scientists - Analysis (understanding and deriving value from data), and Building (building systems that work in production, while sharing skills with type A).

In DE, there are type A and B engineers too. 

Type A data engineers 

    A stands for abstraction. In this case, the data engineer avoids undifferentiated heavy lifting, keeping data architecture as abstract and straightforward as possible and not reinventing the wheel. Type A data engineers manage the data engineering lifecycle mainly by using entirely off-the-shelf products, managed services, and tools. Type A data engineers work at companies across industries and at all levels of data maturity. 

Type B data engineers 

    B stands for build. Type B data engineers build data tools and systems that scale and leverage a company’s core competency and competitive advantage. In the data maturity range, a type B data engineer is more commonly found at companies in stage 2 and 3 (scaling and leading with data), or when an initial data use case is so unique and mission-critical that custom data tools are required to get started.

It can be different people or 1 person that performs both.

Data Engineers Inside an Organization


An external-facing data engineer designs and manages systems for collecting, storing, and processing data from user-facing applications like social media, IoT devices, and e-commerce platforms. These systems face challenges such as high concurrency, query limits to manage infrastructure impact, and complex security issues, especially with multitenant data.

An internal-facing data engineer typically focuses on activities crucial to the needs of the business and internal stakeholders. Examples include creating and maintaining data pipelines and data warehouses for BI dashboards, reports, business processes, data science, and ML models.


How Data Engineers Interact with Other Roles



Next, Module 5: Orchestration of the LLM zoomcamp


In mage, there is a new pipeline type for RAG apps:

In such a pipeline there is a data prep and inference stage.

The first data prep stage includes:

Loading -> Chunking -> Tokenization -> Embedding -> Saving in a vector db

To be honest, just like with the original MLOps zoomcamp videos that introduce mage, these videos seemed rushed and do not explain much while there being a lot of moving pieces in the UI. In total it is 8 videos, each ~2 minutes. Seems that this RAG pipeline will be part of Mage Pro (a paid version) so I will try not to be too attached to it as I doubt I am going to pay to use Mage. Nevertheless, I might end up using this feature for my LLM zoomcamp project. Or I might just go with a normal pipeline with normal blocks. 


That is all for today!

See you tomorrow :)

Popular posts from this blog

[Day 198] Transactions Data Streaming Pipeline Porject [v1 completed]

[Day 107] Transforming natural language to charts

[Day 54] I became a backprop ninja! (woohoo)