Data Engineering Fundamentals
What is data engineering and what do data engineers do:
Data Engineering focusses on designing, developing, testing, maintaining database and large-scale processing systems.
Data engineers extract data from various sources(e.g. APIs, database, csv) and load into one single database for further use. Data engineers optimize the data schema for faster query. They monitor pipelines to make sure that data isn’t corrupt and repair pipelines when needed. They schedule and automate tasks to reduce manual efforts.
Data Engineering Tools:
Data Engineers moves the data from various sources, processes and cleans them, and finally load it into analytical database for further use by Data Scientists. The data engineering tools can be categorized into 3 categories:
Database:
Types of data: structured(relational database -> MySQL), semi-structured(key value pair -> JSON) or unstructured(file -> photos/videos).
NoSQL: Unstructured or semi-structured data(Redis, MongoDB). Use cases are caching and distributed configuration.
Database schema describes the structure and relation between the tables in database. Designing a database schema is the first step in building pipeline. It requires a lot of planning and an ineffective schema can prove to be expensive consuming lot of memory and resources.
Distributed computing: The big data processing tasks are split into several small sub-tasks. The processing tools distributed these subtasks over several computers. Advantages include faster processing, less memory footprint. Distributed computing has some overheads as it requires communication between processes.
Processing Framework(Spark, Hive, Kafka):
- Hadoop: Hadoop is collection of open-source projects maintained by Apache Software Foundation. It’s a framework for distributed processing of large data sets among clusters of computers. Hadoop plays principle role in ETL(Extract, Transform, Load) processing.
Hadoop uses many algorithms, two of the important ones are:
- HDFS: Hadoop Distributed File System. Files reside on multiple different computers. Essential for distributed computing. Replaced by Amazon S3 and Google Cloud.
- Map Reduce: Big Data Processing model. Program splits tasks into sub-tasks where data and workload is distributed between several processing units(clusters of computers). Hard to write map reduce jobs.
2. Hive(Data extraction): Helps in extraction part of ETL pipeline. Hive is built on top of Hadoop eco system. Data can be queried in a structured way using Hive SQL(Structured Query Language). Hive provides interface similar to SQL to query data but the job is divided into cluster of computers. Data extractions can be performed on database that can merge with Hadoop. Initially hive worked well with map reduce but currently it works with other data processing tools too.
3. Spark (Data Transformation Tools): Parallel computation framework. Spark distributes the tasks between clusters of computers. Disk writes between jobs in Map Reduce are expensive. To avoid expensive disk writes and for faster processing, Spark came into existence. Spark tries to keep as much as processing possible in memory. Spark relies on resilient Distributed Datasets(RDD).
RDD is a data structure that maintains data(that was distributed) between various nodes. They don’t have named column like list of tupples. Lost data can be recovered by tracking information about data linege. Transformations (filter(), map(), union())and actions(count(), collect(), reduce()) are the two operations supported by RDD. Transformations results in transformed RDD while actions result in an output.
PySpark: Programming language interface. It’s a Python API for Spark. Similar to pandas because of DataFrame abstraction(Two dimensional heterogeneous tabular data).
Automation( Apache, Airflow, Oozie, Luigi):
4. Airflow(Automation): Suppose, we get data from various sources(csv, API), we need to clean up the data and load it into database for Data Scientists. We might have to merge data from all the sources after cleanup and then add it into database. This builds up dependencies between tasks as merging would happen only after data cleanup gathered from different sources.
We can build a directed acyclic graph to map out the dependencies between jobs.
Apache Airflow is a scheduling tool. Airflow is a tool for describing, executing and monitoring pipelines.
ETL Pipeline
Data extraction means extracting data from persistent storage(File, Database, API’s) into memory for performing data transformations.
Data format could be:
- Unstructured(plain text files)
- Flat files(Row is record, column is feature -> cvs file)
- JSON JavaScript Object Notation(semi-structured data) Data/content is majorly transferred using json format when data is requested through API.
- Databases: Transaction database/ application database and Analytical database(optimized for querying aggregated data).