ACID transactions in Hive

Now a days there is growing need of updating/deleting of data in hive. Lets have quick look at how to update/delete data in ACID tables in hive.

Minimum pre-requisites to perform Hive CRUD using ACID operations are

  1. Hive version 0.14 and later
  2. Table created with file format must be in ORC file format with TBLPROPERTIES(“transactional”=”true”)
  3. Table must be CLUSTERED BY with Bucketing
  4. External tables cannot be made ACID tables since the changes on external tables are beyond the control of the compactor (HIVE-13175)
  5. BEGIN, COMMIT, and ROLLBACK are not yet supported, all language operations are auto-commit
  6. Reading/writing to an ACID table from a non-ACID session is not allowed. In other words, the Hive transaction manager must be set to org.apache.hadoop.hive.ql.lockmgr.DbTxnManager in order to work with ACID tables
  7. LOAD DATA… statement is not supported with transactional tables. (This was not properly enforced until HIVE-16732)
  8. Configuration parameters for supporting transactions(added in Hive 0.14):
        ◉ set hive.support.concurrency=true;
        ◉ set hive.enforce.bucketing=true;
        ◉ set hive.exec.dynamic.partition.mode=nonstrict;
        ◉ set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
        ◉ set hive.compactor.initiator.on=true;
        ◉ set hive.compactor.worker.threads=1;

Here is our sample file sample.data

99001,Adam,California
99002,Brain,New York
99003,Crane,Chicago
99004,David,California
99005,Emily,New York
99006,Frank,Chicago
99007,George,Chicago
99008,Hall,New York
99009,Ivan,California
99010,Jacob,New York

Follow the steps shown to work with hive transcational data —

1
2
3
4
5
6
SET hive.support.concurrency=TRUE;
SET hive.enforce.bucketing=TRUE;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on=TRUE;
SET hive.compactor.worker.threads=1;
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- Create temporary table 'emp_temp'
CREATE TEMPORARY TABLE emp_temp(id INT, name STRING, location STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';

-- Load 'sample.data' into table 'emp_temp'
LOAD DATA LOCAL INPATH '/path/to/input/file/sample.data' INTO TABLE emp_temp;

-- Now create main table 'emp_tbl' with file format 'ORC'
CREATE TABLE emp_tbl(id INT, name STRING, location STRING)
CLUSTERED BY (ID) INTO 2 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS ORC
TBLPROPERTIES("orc.compress.size"="1024", "transactional"="true");

-- Load data into table 'emp_tbl' from temporary table 'emp_temp'
FROM emp_temp
INSERT INTO emp_tbl
SELECT id, name, location
ORDER BY 1;
30
31
-- Check data loaded to table successfully
SELECT * FROM emp_tbl;

☛ Run UPDATE statement

1
UPDATE emp_tbl SET location = 'Delhi' WHERE location = 'New York';
1
2
-- Check if row(s) in the table 'UPDATED' successfully
SELECT * FROM emp_tbl;

DELETE statement

1
DELETE FROM emp_tbl WHERE name = 'Crane';
1
2
-- Check if row(s) in the table 'DELETED' successfully
SELECT * FROM emp_tbl;

I have given you the idea, rest is your imagination.
Hope you like the post. You can post your comments/suggestions below.

2 thoughts on “ACID transactions in Hive”

  1. But after set hive.support.concurrency=true
    When I run any hive query eg. Show tables, select * from EMP…..the hive session will be stucked…..I wants 2 know why is stuck (halt)??

Leave a Reply

Your email address will not be published. Required fields are marked *