当前位置: 首页 > news >正文

Data Engineering - Management Preparation

Data Science

We define DS as Engineering + Analytics. Analytic part has alr demonstrated through another blog. This time we focus on Engineering. Where Engineering = Management (Structured & Unstructured) + Preparation. We focus more on OLAP.

The core idea in Data Engineering is Data Redundancy. (If want do OLAP, must duplicate into the component.)

1 Data Management

Key words for review: Workload, (Big) Data Characteristics(Volume, Variety, Velocity).

  • The main idea for data management is Data Storage. Generally, the first step is to define OLTP/OLAP, then next choose from 4V. Modernly, we tend to use many small engines(stateless DBMS) to construct a big logical, decoupled platform(lakehouse).

Data Platform/Framework: Warehouses, Lakehouses, etc. The difference between Warehouses and Lakehouses is if they are decoupled (we say an integrated OLAP DBMS = Warehouse, multiple engine components = Lakehouse).

  • Data Warehouse Architecture (sigmod2016,snowflake): Storage Layer -> Virtual Warehouse Layer -> Cloud Service Layer. Any service must be based on VW.

  • Data Lakehouse Architecture (cidr2021,databricks): Object Store Layer -> Metadata Layer -> Engine Layer. Any service can access data on their own directly.

First we learn something called DBMS. Technically, traditional DBMS are different from modern DBMS(we call them engines, eg. Spark SQL).

The core idea for Data Management is -> (Intentional) Data Redundancy (for both workloads).

Database Management is the first topic in this area (which includes structured/traditional relational db etc. and unstructured/modern vector db etc.). Note Database Management is firstly used for structured data (could be used for unstuctured data as well).

We use data-based instead of data-driven because the latter means automatic.

DBMS Functionality

  • Data Model

Schema is abstract, Instance is concrete. Database schema is a set of relation schemas. Relation (instance) is a set of tuples. Tuple = Row = Record. Attribute = Column = Field. We can say relation (abstract) is a table (concrete).

  • Storage Management

Manage how DBMS controls database. (Database System = Database + DBMS, technically we should say MySQL DMBS is designed for row-stored databases, furthermore, MySQL is merely a server layer, actually index and transaction/low-level is based on InnoDB engine inside MySQL.)

  • Metadata Management

Data and metadata should be managed samely.

  • Data Definition and Access

DBMS provides DDL/DML.

  • View Management

Global/Local data.

  • Data Independence

Logical–physical separation is part of this idea. (external + conceptual/logical + internal)

  • Data Integrity

Database consistency (data is correct).

  • Access Control and Privacy

Access control for different groups.

  • Database Maintenence

Query optimization.

Query Process/Optimization

Declarative query -> Parser & translator -> Internal algebraic expression -> Optimizer -> Execution plan -> Evaluation engine -> Query output.

Logical optimization is for the whole plan, Physical optimization (nest loop JOIN, sort merge JOIN, hash JOIN) is for a single operation. We use a very old formula for this course (now we use bigger RAMs) Cost = I/O cost + CPU cost.

We oftenly use three types of indices: B+, Hash, Bitmap. B+ is common, while hash can't be used to search an interval, bitmap is good for OLAP. Note these are only used in DBMS, database itself doesn't participate.

Transaction Processing

For a DBMS, we follow ACID rule -> Atomicity, Consistency, Isolation, Durability.

Most strict concurrency control: serializable (equivalent to a serial history). For conflicts only focus on at least one W on the same data.

Algorithms: pessimistic or optimistic(prefix), with lock or timestamp(suffix). Common methods are 2PL(lock)/MVCC(timestamp).

2PL: 2-phase lock, finish all reading then start writing.

MVCC: If some transaction committed after Ti started and created a new version of an item in Ti's read set, then Ti's snapshot is stale and Ti must abort.(read + write + validate + commit/abort)

1.1 Structured Data Management

1.1.1 Relational Data Model

Note a relation instance is a whole "table" tuples, not a single row.

We have 3 constraints: implicit constraints(tuple, attribute unique)/explicit constraints(primary key, foreign key)/semantic constraints(consistency).

1.1.2 Relational Algebra/SQL

Relational Data Language = Relational calculus + Relational algebra. We focus on algebra.

Relational Algebra forms SQL. (selection/projection/union/set difference/Cartesian product; intersection/join/division; outer join/rename/aggregate/assignment)

Note difference between Union(add row) and Join(add column). Understand division(pick all) and product(search all). Assignments logically creates new table, but rename is only for renaming.

SQL allow duplicate rows(unlike relational model). SQL has DML and DDL. (SELECT ... FROM ... WHERE ... GROUP BY ... HAVING ... ORDER BY ... LIMIT/OFFSET)

Assertions are best for global database constraints across multiple tables or rows; triggers only happen dynamically when an operation is commited.

VIEWs are stored SQL programs.

1.1.3 Workloads

  • OLTP (row): MySQL, PostgreSQL, SQL Server

Minimizes the tuple reconstruction cost. Primary concern is transaction latency and throughput.

Commonly, we assume DBMS = OLTP (just by convenience). NoSQL are mainly all OLTP.

  • OLAP (column): Snowflake, ClickHouse, Amazon Redshift

Reduces I/O by reading only the required attributes. Suit for end-of-month work. This is oftenly for Data Warehouses. OLAP DBMS are mainly all SQL.

Vector Execution (a vector is a short column)

Run-length Encoding (long runs of repeated values)

Late Materialization (reconstructing tuples only when required late in the query plan)

  • HTAP (mixed, modern): SAP HANA, TiDB, SingleStore

Running analytics directly on fresh transactional data. Supports real-time. Introduces NewSQL.

MVCC is particularly important in HTAP systems.

(Very expensive, so can only do small, near OLAP's.)

1.2 Unstructured Data Management

Three main categories in OLAP:

Data Lake

Data Warehouse

Data Lakehouse

4V(or 3V) = (Volume + Variety + Velocity) + Veracity. Big data is approximately semi/unstructured data.

1.2.1 Big Data (Volume)

This is where we introduce Data Lakehouse.

Scale-up is used for single machine, scale-out means scaling the number of machines.

Data Center Architecture

Three sharing architectures: shared-memory, shared-disk, shared-nothing(modern).

Parallel Data Management

  • Data Partitioning Methods: round-robin partitioning(inefficient, simplest)/hash partitioning(skew)/range partitioning(skew)

  • Execution Models: interoperator parallelism(each operator parallel)/intraoperator parallelism(a single operator is done by multiple nodes)

  • Parallel Algorithms: parallel sort/parallel nested-loop join/parallel hash join(suits for shared nothing, because could be done locally)

  • Impediments to Parallelism: Machine numbers scaling-up doesn't result in linear promotion in performance.

Distributed Data Storage

HDFS has a single NameNode (metadata), multiple DataNode (data chunks).

Big Data Processing Frameworks

Here's where we introduce Distributed Frameworks. For example, Spark SQL + Spark Core + HDFS looks like a OLAP 'DBMS', but since they are not inside a real DBMS, and far more intelligent than a DBMS could do, we call it a data lakehouse.

  • Hadoop/MapReduce (traditional)

Start to use HDFS as low-level.

Three step: Map(record value) + Shuffle(group by keys from all mapper, most costly) + Reduce(aggregate values for each key).

Number of maps = splits. Number of mapper outputs = number of reducers. Note the same key will go into the same reducer: f(key) = hash(key) mod numOfReducer. (shuffle)

  • Spark (modern)

  • Cloud-native Systems (new)

Suits log-structured storage, immutable micro-partitions, tiered storage, pushdown / near-data processing, consensus-based replication, global clocks / TrueTime / HLC.

Cloud-native must be cloud-hosted.

For OLTP: disaggregated compute-storage(decouple storage and computation); disaggregated compute-log-storage(flush based on logs, speeds up write); disaggregated compute-buffer-storage(shared buffer/cache, speeds up read).
For OLAP: disaggregated compute-storage; disaggregated compute-memory-storage(add a memory module for intermediates).

Geographic Distribution

The only new idea is allocation (close data should be put near). For distributed systems, we should calc not only CPU, I/O, but communication cost as well.

Local serializability is not enough:

  • Centralized 2PL (simplest)

  • Distributed 2PL

  • Distributed Deadlock

  • 2PC

1.2.2 Big Data (Variety)

This is where we introduce NoSQL. (to solve 'one size fits all' problem)

Text Data

Traditional: inverted index -> tokenization -> stemming -> TF-IDF -> Lucene / Elasticsearch -> NLP extraction
Modern: embeddings -> semantic retrieval -> vector indexing / ANN

Image Data

Traditional: CBIR, SIFT / HOG.
Modern: CNN / ViT, Multimodal.

Video and Audio Data

The complexity comes from time structure.

NoSQL Systems

  • Key-value Stores(simplest): Redis, DynamoDB

  • Document Stores(JSON-like): MongoDB, Couchbase

  • Wide-column Stores(distributed write/sparse): Cassandra, HBase

  • Graph DBMSs(relation dense): Neo4j

1.2.3 Big Data (Velocity)

This is where we introduce streaming systems (Flink SQL).

DMS(traditional, query-driven / pull-based), DSS(modern, data-driven / push-based).

DSS Architecture:

  • input monitor
  • working storage
  • query repository
  • query processor
  • output buffer

In Stream Data Model, simplest tuple is <timestamp, payload>.

Windows are most important idea in streaming systems, specifically tumbling window/sliding window.

Query Models and Operators are monotonic or non-monotonic, having operators (selection/aggregation/multiplexing/demultiplexing/stream mining/joins/windowed queries).

Stream Query Processing focuses on load shedding(approx etc.) and out-of-order processing(time buffer etc.)

Fault Tolerance includes at-least-once/at-most-once/exactly-once, common recover mechanisms: replication/checkpointing/upstream backup.

2 Data Preparation

2.1 Data Acquisition

Database, API, web crawling/scraping, streaming/sensor feeds, data marketplaces/open data, crowdsourced data

2.2 Data Profiing

Mine out structural information. Provides metadata. (differs from data documentation, which is more detailed)

2.3 Data Selection

Has two parts: pre-integration(relevance/quality/availability/integration complexity/bias/representativeness) and post-integration(data lake/semantic understanding).

2.4 Data Integration

Integrate different sources of data (multimodal etc.).

2.5 Data Quality

This is where we introduce data cleaning.

http://www.jsqmd.com/news/601327/

相关文章:

  • 解锁论文写作新姿势:好写作AI,你的学术创作超级英雄
  • 避坑指南:为什么你的Matlab编译Ncorr总失败?盘点TDM-GCC版本、环境变量与Matlab兼容性的那些坑
  • 数字逻辑设计新范式:Logisim-Evolution全方位实践指南
  • OpenClaw敏感信息过滤:Qwen3-32B任务中的隐私保护机制
  • 打破输入法壁垒:如何让你的个人词库在全平台自由流动
  • Phi-4-mini-reasoning保姆级教学:PyTorch 2.8+Transformers加载全流程
  • 揭密携程任我行礼品卡回收变现的最佳平台与流程 - 团团收购物卡回收
  • Adobe-GenP:如何用通用补丁工具解锁Adobe CC全系列软件?
  • 猫抓扩展终极攻略:一键下载全网视频音频资源
  • 新手必看,利用快马生成mobaxterm中文设置图文指南
  • OpenClaw+百川2-13B-4bits:个人博客内容自动生成与发布方案
  • 如何突破网盘限速?3大核心方案+5种实战技巧
  • Godot游戏资源解包全攻略:3步轻松提取PCK文件内容
  • 告别重复劳动!Z-Image-ComfyUI程序化调用,一键生成多张图片
  • 环境迁移指南:将OpenClaw+百川2-13B-4bits从本地搬到云服务器
  • CH-SIMS数据集解析:细粒度多模态情感分析在中文场景下的实践与优化
  • FRCRN(damo/speech_frcrn_ans_cirm_16k)企业级部署:Prometheus监控指标接入
  • StreamCap:构建直播内容捕获的神经网络式生态系统
  • avalonia在国产芯片瑞芯微RK3588这么容易就跑起来了?
  • Unity火灾逃生模拟仿真:开启身临其境的演练之旅
  • bilibili-linux:Linux平台下的B站无缝体验解决方案
  • 终极指南:如何用RimSort快速解决环世界MOD排序难题
  • 出售加油卡的最佳选择:快速、安全又可靠 - 团团收购物卡回收
  • DeepSeek-R1推理模型入门:Ollama快速部署与实战应用解析
  • C#数字格式化实战:从基础保留小数到高级字符串处理
  • AI写论文新选择!4款AI论文生成工具,高效完成毕业论文创作!
  • AI 模型加载优化方案
  • Qwen-Image-Lightning升级体验:Lightning LoRA加速技术到底有多快?
  • 好靶场-csrf
  • 2025北京高考语文真题Word版下载(含答案解析)