Now a days there is growing need of updating/deleting of data in hive. Lets have quick look at how to update/delete data in ACID tables in hive.
Minimum pre-requisites to perform Hive CRUD using ACID operations are —
- Hive version 0.14 and later
- Table created with file format must be in ORC file format with TBLPROPERTIES(“transactional”=”true”)
- Table must be CLUSTERED BY with Bucketing
- External tables cannot be made ACID tables since the changes on external tables are beyond the control of the compactor (HIVE-13175)
- BEGIN, COMMIT, and ROLLBACK are not yet supported, all language operations are auto-commit
- Reading/writing to an ACID table from a non-ACID session is not allowed. In other words, the Hive transaction manager must be set to org.apache.hadoop.hive.ql.lockmgr.DbTxnManager in order to work with ACID tables
- LOAD DATA… statement is not supported with transactional tables. (This was not properly enforced until HIVE-16732)
- Configuration parameters for supporting transactions(added in Hive 0.14):
◉ set hive.support.concurrency=true;
◉ set hive.enforce.bucketing=true;
◉ set hive.exec.dynamic.partition.mode=nonstrict;
◉ set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
◉ set hive.compactor.initiator.on=true;
◉ set hive.compactor.worker.threads=1;
Here is our sample file sample.data —
99001,Adam,California 99002,Brain,New York 99003,Crane,Chicago 99004,David,California 99005,Emily,New York 99006,Frank,Chicago 99007,George,Chicago 99008,Hall,New York 99009,Ivan,California 99010,Jacob,New York
Follow the steps shown to work with hive transcational data —
1 2 3 4 5 6 | SET hive.support.concurrency=TRUE; SET hive.enforce.bucketing=TRUE; SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; SET hive.compactor.initiator.on=TRUE; SET hive.compactor.worker.threads=1; |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | -- Create temporary table 'emp_temp' CREATE TEMPORARY TABLE emp_temp(id INT, name STRING, location STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'; -- Load 'sample.data' into table 'emp_temp' LOAD DATA LOCAL INPATH '/path/to/input/file/sample.data' INTO TABLE emp_temp; -- Now create main table 'emp_tbl' with file format 'ORC' CREATE TABLE emp_tbl(id INT, name STRING, location STRING) CLUSTERED BY (ID) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS ORC TBLPROPERTIES("orc.compress.size"="1024", "transactional"="true"); -- Load data into table 'emp_tbl' from temporary table 'emp_temp' FROM emp_temp INSERT INTO emp_tbl SELECT id, name, location ORDER BY 1; |
30 31 | -- Check data loaded to table successfully SELECT * FROM emp_tbl; |
☛ Run UPDATE statement
1 | UPDATE emp_tbl SET location = 'Delhi' WHERE location = 'New York'; |
1 2 | -- Check if row(s) in the table 'UPDATED' successfully SELECT * FROM emp_tbl; |
☛ DELETE statement
1 | DELETE FROM emp_tbl WHERE name = 'Crane'; |
1 2 | -- Check if row(s) in the table 'DELETED' successfully SELECT * FROM emp_tbl; |
I have given you the idea, rest is your imagination.
Hope you like the post. You can post your comments/suggestions below.