当前位置: 首页 > news >正文

HiveQL - merge

Below are Postgres SQLs:

insert into kv (key, value) values ('cache:foo', 456) on conflict (key) do nothing;insert into kv (key, value) values ('cache:foo', 456) on conflict (key) do update set value = excluded.value;

 

Does Hive SQL support the 'on conflict' statement like Postgres?

 

The short answer is no. Hive SQL (HQL) does not support the ON CONFLICT syntax found in PostgreSQL.

Hive’s architecture is fundamentally different from a traditional transactional database like Postgres. While Postgres is designed for high-concurrency row-level operations, Hive is built for batch processing over large datasets in a distributed file system (like HDFS).

Why doesn't it work?

In Postgres, ON CONFLICT (often called "UPSERT") relies on unique constraints or indexes to detect collisions.

  • Hive does not enforce unique constraints. You can define a primary key for metadata purposes, but Hive won't stop you from inserting duplicate keys.

  • Storage Layer: Hive typically appends data to files. Checking for a "conflict" would require scanning the entire dataset for every single row inserted, which would be incredibly slow at scale.


How to achieve the same result in Hive

Since you can't use ON CONFLICT, you generally have two options depending on your Hive version and table configuration:

1. The MERGE Statement (Hive 2.2+)

If your table is an ACID-enabled table (using ORC or Parquet format with bucketing), you can use the MERGE statement, which is the standard way to perform upserts in Big Data environments.

Equivalent to ON CONFLICT DO UPDATE:

MERGE INTO kv AS target
USING (SELECT 'cache:foo' as key, 456 as value) AS source
ON target.key = source.key
WHEN MATCHED THENUPDATE SET value = source.value
WHEN NOT MATCHED THENINSERT VALUES (source.key, source.value);

2. The "Overwrite" Pattern (Standard Hive)

If you aren't using ACID tables, the common practice is to overwrite the entire table or partition using a LEFT JOIN to filter out old data.

Equivalent to ON CONFLICT DO NOTHING:

INSERT OVERWRITE TABLE kv
SELECT s.key, s.value
FROM staging_table s
LEFT JOIN kv t ON s.key = t.key
WHERE t.key IS NULL; -- Only keep rows that don't already exist

Summary Comparison

Feature PostgreSQL Hive SQL
UPSERT Syntax ON CONFLICT MERGE INTO (ACID only)
Constraint Enforcement Immediate & Strict Not Enforced (Informational only)
Primary Use Case OLTP (Real-time updates) OLAP (Batch processing)