当前位置：首页 > news >正文

Data Engineering - Management Preparation

news 2026/6/17 0:06:21

Data Science

We define DS as Engineering + Analytics. Analytic part has alr demonstrated through another blog. This time we focus on Engineering. Where Engineering = Management (Structured & Unstructured) + Preparation. We focus more on OLAP.

The core idea in Data Engineering is Data Redundancy. (If want do OLAP, must duplicate into the component.)

1 Data Management

Key words for review: Workload, (Big) Data Characteristics(Volume, Variety, Velocity).

The main idea for data management is Data Storage. Generally, the first step is to define OLTP/OLAP, then next choose from 4V. Modernly, we tend to use many small engines(stateless DBMS) to construct a big logical, decoupled platform(lakehouse).

Data Platform/Framework: Warehouses, Lakehouses, etc. The difference between Warehouses and Lakehouses is if they are decoupled (we say an integrated OLAP DBMS = Warehouse, multiple engine components = Lakehouse).

Data Warehouse Architecture (sigmod2016,snowflake): Storage Layer -> Virtual Warehouse Layer -> Cloud Service Layer. Any service must be based on VW.
Data Lakehouse Architecture (cidr2021,databricks): Object Store Layer -> Metadata Layer -> Engine Layer. Any service can access data on their own directly.

First we learn something called DBMS. Technically, traditional DBMS are different from modern DBMS(we call them engines, eg. Spark SQL).

The core idea for Data Management is -> (Intentional) Data Redundancy (for both workloads).

Database Management is the first topic in this area (which includes structured/traditional relational db etc. and unstructured/modern vector db etc.). Note Database Management is firstly used for structured data (could be used for unstuctured data as well).

We use data-based instead of data-driven because the latter means automatic.

DBMS Functionality

Data Model

Schema is abstract, Instance is concrete. Database schema is a set of relation schemas. Relation (instance) is a set of tuples. Tuple = Row = Record. Attribute = Column = Field. We can say relation (abstract) is a table (concrete).

Storage Management

Manage how DBMS controls database. (Database System = Database + DBMS, technically we should say MySQL DMBS is designed for row-stored databases, furthermore, MySQL is merely a server layer, actually index and transaction/low-level is based on InnoDB engine inside MySQL.)

Metadata Management

Data and metadata should be managed samely.

Data Definition and Access

DBMS provides DDL/DML.

View Management

Global/Local data.

Data Independence

Logical–physical separation is part of this idea. (external + conceptual/logical + internal)

Data Integrity

Database consistency (data is correct).

Access Control and Privacy

Access control for different groups.

Database Maintenence

Query optimization.

Query Process/Optimization

Declarative query -> Parser & translator -> Internal algebraic expression -> Optimizer -> Execution plan -> Evaluation engine -> Query output.

Logical optimization is for the whole plan, Physical optimization (nest loop JOIN, sort merge JOIN, hash JOIN) is for a single operation. We use a very old formula for this course (now we use bigger RAMs) Cost = I/O cost + CPU cost.

We oftenly use three types of indices: B+, Hash, Bitmap. B+ is common, while hash can't be used to search an interval, bitmap is good for OLAP. Note these are only used in DBMS, database itself doesn't participate.

Transaction Processing

For a DBMS, we follow ACID rule -> Atomicity, Consistency, Isolation, Durability.

Most strict concurrency control: serializable (equivalent to a serial history). For conflicts only focus on at least one W on the same data.

Algorithms: pessimistic or optimistic(prefix), with lock or timestamp(suffix). Common methods are 2PL(lock)/MVCC(timestamp).

2PL: 2-phase lock, finish all reading then start writing.

MVCC: If some transaction committed after Ti started and created a new version of an item in Ti's read set, then Ti's snapshot is stale and Ti must abort.(read + write + validate + commit/abort)

1.1 Structured Data Management

1.1.1 Relational Data Model

Note a relation instance is a whole "table" tuples, not a single row.

We have 3 constraints: implicit constraints(tuple, attribute unique)/explicit constraints(primary key, foreign key)/semantic constraints(consistency).

1.1.2 Relational Algebra/SQL

Relational Data Language = Relational calculus + Relational algebra. We focus on algebra.

Relational Algebra forms SQL. (selection/projection/union/set difference/Cartesian product; intersection/join/division; outer join/rename/aggregate/assignment)

Note difference between Union(add row) and Join(add column). Understand division(pick all) and product(search all). Assignments logically creates new table, but rename is only for renaming.

SQL allow duplicate rows(unlike relational model). SQL has DML and DDL. (SELECT ... FROM ... WHERE ... GROUP BY ... HAVING ... ORDER BY ... LIMIT/OFFSET)

Assertions are best for global database constraints across multiple tables or rows; triggers only happen dynamically when an operation is commited.

VIEWs are stored SQL programs.

1.1.3 Workloads

OLTP (row): MySQL, PostgreSQL, SQL Server

Minimizes the tuple reconstruction cost. Primary concern is transaction latency and throughput.

Commonly, we assume DBMS = OLTP (just by convenience). NoSQL are mainly all OLTP.

OLAP (column): Snowflake, ClickHouse, Amazon Redshift

Reduces I/O by reading only the required attributes. Suit for end-of-month work. This is oftenly for Data Warehouses. OLAP DBMS are mainly all SQL.

Vector Execution (a vector is a short column)

Run-length Encoding (long runs of repeated values)

Late Materialization (reconstructing tuples only when required late in the query plan)

HTAP (mixed, modern): SAP HANA, TiDB, SingleStore

Running analytics directly on fresh transactional data. Supports real-time. Introduces NewSQL.

MVCC is particularly important in HTAP systems.

(Very expensive, so can only do small, near OLAP's.)

1.2 Unstructured Data Management

Three main categories in OLAP:

Data Lake

Data Warehouse

Data Lakehouse

4V(or 3V) = (Volume + Variety + Velocity) + Veracity. Big data is approximately semi/unstructured data.

1.2.1 Big Data (Volume)

This is where we introduce Data Lakehouse.

Scale-up is used for single machine, scale-out means scaling the number of machines.

Data Center Architecture

Three sharing architectures: shared-memory, shared-disk, shared-nothing(modern).

Parallel Data Management

Data Partitioning Methods: round-robin partitioning(inefficient, simplest)/hash partitioning(skew)/range partitioning(skew)
Execution Models: interoperator parallelism(each operator parallel)/intraoperator parallelism(a single operator is done by multiple nodes)
Parallel Algorithms: parallel sort/parallel nested-loop join/parallel hash join(suits for shared nothing, because could be done locally)
Impediments to Parallelism: Machine numbers scaling-up doesn't result in linear promotion in performance.

Distributed Data Storage

HDFS has a single NameNode (metadata), multiple DataNode (data chunks).

Big Data Processing Frameworks

Here's where we introduce Distributed Frameworks. For example, Spark SQL + Spark Core + HDFS looks like a OLAP 'DBMS', but since they are not inside a real DBMS, and far more intelligent than a DBMS could do, we call it a data lakehouse.

Hadoop/MapReduce (traditional)

Start to use HDFS as low-level.

Three step: Map(record value) + Shuffle(group by keys from all mapper, most costly) + Reduce(aggregate values for each key).

Number of maps = splits. Number of mapper outputs = number of reducers. Note the same key will go into the same reducer: f(key) = hash(key) mod numOfReducer. (shuffle)

Spark (modern)
Cloud-native Systems (new)

Suits log-structured storage, immutable micro-partitions, tiered storage, pushdown / near-data processing, consensus-based replication, global clocks / TrueTime / HLC.

Cloud-native must be cloud-hosted.

For OLTP: disaggregated compute-storage(decouple storage and computation); disaggregated compute-log-storage(flush based on logs, speeds up write); disaggregated compute-buffer-storage(shared buffer/cache, speeds up read).
For OLAP: disaggregated compute-storage; disaggregated compute-memory-storage(add a memory module for intermediates).

Geographic Distribution

The only new idea is allocation (close data should be put near). For distributed systems, we should calc not only CPU, I/O, but communication cost as well.

Local serializability is not enough:

Centralized 2PL (simplest)
Distributed 2PL
Distributed Deadlock
2PC

1.2.2 Big Data (Variety)

This is where we introduce NoSQL. (to solve 'one size fits all' problem)

Text Data

Traditional: inverted index -> tokenization -> stemming -> TF-IDF -> Lucene / Elasticsearch -> NLP extraction
Modern: embeddings -> semantic retrieval -> vector indexing / ANN

Image Data

Traditional: CBIR, SIFT / HOG.
Modern: CNN / ViT, Multimodal.

Video and Audio Data

The complexity comes from time structure.

NoSQL Systems

Key-value Stores(simplest): Redis, DynamoDB
Document Stores(JSON-like): MongoDB, Couchbase
Wide-column Stores(distributed write/sparse): Cassandra, HBase
Graph DBMSs(relation dense): Neo4j

1.2.3 Big Data (Velocity)

This is where we introduce streaming systems (Flink SQL).

DMS(traditional, query-driven / pull-based), DSS(modern, data-driven / push-based).

DSS Architecture:

input monitor
working storage
query repository
query processor
output buffer

In Stream Data Model, simplest tuple is <timestamp, payload>.

Windows are most important idea in streaming systems, specifically tumbling window/sliding window.

Query Models and Operators are monotonic or non-monotonic, having operators (selection/aggregation/multiplexing/demultiplexing/stream mining/joins/windowed queries).

Stream Query Processing focuses on load shedding(approx etc.) and out-of-order processing(time buffer etc.)

Fault Tolerance includes at-least-once/at-most-once/exactly-once, common recover mechanisms: replication/checkpointing/upstream backup.

2 Data Preparation

2.1 Data Acquisition

Database, API, web crawling/scraping, streaming/sensor feeds, data marketplaces/open data, crowdsourced data

2.2 Data Profiing

Mine out structural information. Provides metadata. (differs from data documentation, which is more detailed)

2.3 Data Selection

Has two parts: pre-integration(relevance/quality/availability/integration complexity/bias/representativeness) and post-integration(data lake/semantic understanding).