Automatic (Dynamic) Partitioning With Hive
Partitioning in Hive is usually considered to be tedious task while loading the data. You have to manually segregate the data and have to tell Hive to which partition you want to to load data. This requires a lot of manual intervention and effort.
Further partitioning is usually done on the columns that does not exist in the data. What else if we want to partition on the column that exist in our data. I usually became clueless in this scenario.
Another problem is querying the one large HDFS file that contains historical data. If we can somehow divide the large file into small pieces we can get a faster turn around time while querying.
Consider a situation when we have one file that has two fields:ORDERID and ORDERDATE. Lets say each day we receive large number of orders. We want to view that file in the Hive but creating a one large table and querying the data is a heavy operation. So another way is to view this file as Hive Table partitioned by Date and all the partitions fills automatically depending upon value in the input file.
This problem can be solved by a two step process:
1) Set couple of properties in Hive
2)Create a external staging table "staging_order" and load the input files data to this table.
3) Create a main production external table "production_order" with the date as one of the partitioned columns.
4) Load the production table from the staging table so that data is distributed in partitions automatically.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Input File: (Order_ID,Date) | |
1,20140605 | |
2,20140605 | |
3,20140604 | |
4,20140604 | |
--Set following two properties for your Hive session: | |
SET hive.exec.dynamic.partition=true; | |
SET hive.exec.dynamic.partition.mode=nonstric; | |
--Create your staging table without Partition | |
hive> create external table tanveer.staging_orders (orderid int,order_date string) row format delimited fields terminated by ","; | |
OK | |
Time taken: 4.235 seconds | |
hive> | |
--load data to your Hive Table: | |
hive> load data local inpath '/hivetables/order_date' into table tanveer.staging_orders; | |
Copying data from file:/home/auto/tkhan3/hivetables/order_date | |
Copying file: file:/home/auto/tkhan3/hivetables/order_date/20140605 | |
Copying file: file:/home/auto/tkhan3/hivetables/order_date/20140604 | |
Loading data to table tanveer.staging_orders | |
OK | |
Time taken: 1.023 seconds | |
--verify data is loaded correctly in staging table | |
hive> select count(*) from tanveer.staging_orders; | |
OK | |
4 | |
--Seems we are good that data loaded into Hive. | |
--Create production table with the columns you want to parition upon. In this case we want to partition upon date column: | |
hive> create external table tanveer.production_orders (orderid string,order_date string) partitioned by (date_part string) row format delimited fields terminated by ','; | |
OK | |
hive> | |
--Load the Data into production table and insert from staging table such that data is paritioned in production table depending upon the date in record in staging table. | |
--Do make sure that the column on which you want to partition should come last in select statements. If there are series of column then there oerder in parition(col3,col4) should match in select statement. | |
hive> insert overwrite table tanveer.production_orders partition(date_part) select orderid, order_date,order_date as date_part from tanveer.staging_orders; | |
OK | |
--Verify data is correctly populated in partition: | |
hive> select * from tanveer.production_orders where date_part="20140605"; | |
OK | |
3 20140605 20140605 | |
4 20140605 20140605 | |
hive> select * from tanveer.production_orders where date_part="20140604"; | |
OK | |
1 20140604 20140604 | |
2 20140604 20140604 | |