Data Processing
What is it?
Data Processing is the act of converting raw data into meaningful, usable information. In the context of data integration, it sits between the source and the destination, handling the transformation logic. This includes cleaning (removing errors), normalization (standardizing formats), aggregation (summarizing), and enrichment (adding value from other sources).
It can happen in two main paradigms:
- ETL (Extract, Transform, Load): Data is processed before landing in the destination (typical for legacy Data Warehouses).
- ELT (Extract, Load, Transform): Raw data is loaded into the destination first, then processed using the destination’s compute power (typical for modern Cloud Data Lakes like Snowflake/BigQuery).
Why is it Important?
- Data Quality: Raw data is often messy, messy, or incomplete. Processing ensures trust in the final output.
- Business Logic: It applies specific business rules (e.g., “calculate Monthly Recurring Revenue”) that turn generic logs into business metrics.
- Performance: Pre-aggregating or indexing data during processing makes downstream querying significantly faster and cheaper.
Well-known Solutions
- Engines: Apache Spark, Apache Flink, Pandas/Dask (Python).
- Frameworks: dbt (data build tool) for SQL-based transformation, Informatica (Traditional ETL).
- Cloud Native: AWS Glue, Google Dataflow.