You probably don’t think about data engineering when surfing the web.
Maybe it crosses your mind when you click on an online item and see a few suggested products related to the item you’re viewing.
You know those suggestions come from machine learning algorithms, which are a product of data engineering. But that’s like explaining how the magician pulled the rabbit from the hat by saying, “Oh, it’s an illusion.”
How does machine learning work? Oh, it’s data engineering.
But how does data engineering work? Let’s pull back the curtain and reveal the magic that data engineers do on a daily basis.
Basic Data Engineering
At its most basic, data engineering takes data from a source, preserves it, and makes it available to analysts. (Some would say this is so basic, it isn’t really “data engineering” at all.)
The source of the raw data might be an application your business built. You might also pull in data from Google Analytics, a CRM application, and maybe a few other sources.
Preserving it means either manually adding it to an Excel spreadsheet or storing it in a database. Making it available for analysis involves a lot of manual labor to clean the data and get it into Excel.
This level of data engineering isn’t sustainable for long. Not only does the amount of monthly data get larger and larger, but your company will want more data to analyze. You can’t have too much data, right?
As the amount of data grows, the manual process of cleaning and storing data becomes more difficult and time-consuming. Pretty soon, you need to find a way to automate the process.
Automation usually starts with an ETL pipeline. ETL stands for “extract, transform, load,” which is what ETL pipelines do.
Extracting data often involves establishing API connections. An API is an Application Programming Interface, a piece of code that pulls data from all your various sources.
Once extracted, you have to transform the data, which means removing errors, changing formats, mapping the same types of records to each other, and validating the data. Some call this process “cleaning the data” and define data transformation as a process that takes clean data and converts it into a new format or structure.
After the ETL tool cleans and transforms your data, it loads it into a database or, more likely, a data warehouse. Your software engineer writes a script to run this process monthly or weekly, depending on your business’s needs.
Your data engineering team analyzes the resulting data with business intelligence (BI) tools, dashboards, and so on. And even though this is a relatively simply ETL pipeline, the greater access to data allows your team to analyze, iterate, improve, and share fresh data on a regular cadence.
ETL pipelines open up a flood of data. Suddenly, you can track the entire sales funnel, from the first visit to the final purchase. You can analyze customer behavior and track high-level KPIs. Your business can make more informed decisions and see how those decisions change the way the company works.
But as the amount of data continues to grow, the pipeline begins to clog. Reports take several minutes to return, and errors become more frequent.
It happens because your pipeline uses a standard transactional database. Transactional databases (for example, MySQL) rapidly fill in the tables. They work well for running the operations of an app, but analytics jobs and complex queries bog them down.
A standard database or repository grabs data from several sources and consolidates it into a central location. But is your data then “centralized”? No. You still have to organize it.
When you pull data from multiple sources, you end up with various types of data: sales reports, site traffic data, demographics information, and so on. A data warehouse organizes all that data into tables, and then organizes the tables by the relationships between the different data types. This is called a schema.
How do schemas help? They structure the data in a way that facilitates and accelerates data analysis. Getting a schema right is crucial, so expect it to take several rounds with your data engineering team before you all agree on the best warehouse design.
Regular databases are designed for running simple transaction queries, whereas data warehouses can run complex analytics queries. With a data warehouse in place, your business’s data flows from multiple sources through your ETL pipeline and into the warehouse. The ETL transforms and validates the data, and then the warehouse organizes it into tables.
Now your analytics teams can use business intelligence tools to interact with the data to gain more significant insights. Your data engineer’s focus changes to refining and maintaining the pipeline and data warehouse.
But you’re still not getting the complete picture from your data. For that, you need to hire a data scientist. And they might need a data lake.
Data Engineering with Data Lakes
Before we get into data lakes, let’s look at how data scientists and data engineers work together.
A data scientist’s job is to predict the future. They dig deep into your data and make hidden connections, uncover new insights, and build models to predict what your customers will do next.
For example, one of your product managers might ask the data scientist to forecast product sales in South America for the year’s fourth quarter. The data scientist would work with your data engineer to design and build a custom pipeline for just that request.
And that’s why you might need a data lake. Your data warehouse only stores structured data for tracking specific metrics. Data scientists often need to process raw, unstructured data. Data lakes keep all the raw data without preprocessing it or imposing a defined schema.
Even the pipeline is different with a data lake. Instead of extracting, transforming, and loading the data, the pipeline simply extracts and loads the data into the data lake, and the data scientist decides how to transform it for their use. So it becomes an ELT pipeline instead of an ETL pipeline, because the transformation happens at the end.
Your data scientist can do a lot with a data lake, from exploring new analytics and making educated forecasts to building machine learning models. Your data engineers must maintain a steady stream of data into the lake.
Data lakes came about because of our ability to generate increasingly large amounts of diverse and unstructured information. Or, as it’s more commonly known, big data.
More than just a vast amount of data, Big Data is characterized by the “four Vs” — Volume, Veracity, Velocity, and Variety.
- Volume refers to how much data is collected. Generally, the more information you collect, the more profound the insights become.
- Veracity relates to how reliable data is. Big data is no good unless it’s valid and comes from a trusted source.
- Velocity is how fast data can be generated, gathered, and analyzed.
- Variety refers to how many points of reference you use to collect data. Most businesses need several reference points or sources to prevent skewing the data.
At this level, you’re processing thousands of transactions simultaneously. And your data engineer becomes an entire team.
Big data, therefore, requires a different approach to the pipeline. Until now, we have been talking about batched data—in other words, the system grabs a batch of data via APIs at prescheduled times. The pipeline extracts, transforms, and loads the data into a warehouse or lake.
With big data, new records often generate every second. More importantly, you may need to process and analyze that data immediately. The ability to quickly handle big data is mission-critical if you’re a major online retailer processing millions of transactions per hour, for example.
Big data systems use a data streaming method called publish and subscribe, or Pub/Sub, to efficiently handle this kind of demand. Pub/Sub allows services to communicate asynchronously, with latencies on the order of 100 milliseconds.
What does that mean? Most web communication is synchronous. That is, the system sends a request via an API and then waits for a response from the data source. Under heavy loads, synchronous communication slows to a crawl.
Pub/Sub enables asynchronous conversations between multiple systems that generate a lot of data simultaneously. Instead of coupling data sources with data consumers, it divides data into different topics. Data consumers subscribe to these topics. As the system generates new data records or events, they get published to the topics, and subscribers can access the data on their schedule.
The advantage is that systems don’t have to wait for each other and send synchronous messages. They can handle thousands of events per second.
Distributed Computing and Storage
Another way to handle big data is distributed storage and distributed computing. To store petabytes of data, you need to distribute it over several servers combined into a cluster.
A popular technology for distributed storage is Hadoop, a framework that enables you to store data in clusters. Hadoop’s popularity comes from its high scalability, allowing you to add more and more computers to the cluster as your data keeps growing. Its redundancy keeps your data secure, even if a catastrophe destroys some of the computers in the cluster.
Your data engineering team needs a data processing framework, like Apache Spark, to manage clusters.
Advanced Data Engineering Pipeline
When you put it all together, your big data pipeline looks like this: Thousands of records per second stream into your Pub/Sub system, where an ETL or ELT framework (e.g., Spark) processes and loads them into data lakes or warehouses, or moves them into custom pipelines. Your data repositories are distributed and stored on server clusters (e.g., Hadoop).
With a big data pipeline in place, data scientists, analytics users, and machine learning algorithms can easily use your data to make predictions and generate new data.
You’re still fulfilling the basic purpose of data engineering: taking data from a source and saving it for later analysis. But as data needs grow, the system gets more complex.
So if you’re an online retailer like Amazon, you need to provide a seamless and fast shopping experience to each user, regardless of how many thousands of records you process per minute. With big data engineering and an advanced data pipeline, each user data point travels through the system at lightning speed, growing your knowledge about your customers and using machine learning to immediately deliver product recommendations based on their actions.
And that, in a nutshell, is the magic of data engineering.
Need help building your data pipeline? Contact us today!