大数据开发之Hive篇19-Hive分区表详解

墨天轮 · · 1272 次点击 · · 开始浏览    
这是一个创建于 的文章,其中的信息可能已经有所发展或是发生改变。

备注:
Hive 版本 2.1.1

Table of Contents

一.Hive分区表概述

数据分区的概念以及存在很久了,通常使用分区来水平分散压力,将数据从物理上移到和使用最频繁的用户更近的地方,以及实现其目的。 hive中有分区表的概念,我们可以看到分区具重要性能优势,而且分区表还可以将数据以一种符合逻辑的方式进行组织,比如分层存储

Hive官方网站对Hive partition的介绍:
可以使用Partitioned BY子句创建分区表。一个表可以有一个或多个分区列,并且为分区列中的每个不同的值组合创建一个单独的数据目录。此外,可以使用按列聚集的方式对表或分区进行存储,并且可以通过按列排序的方式在存储区内对数据进行排序。这可以提高某些查询的性能。

如果在创建分区表时,出现这样的错误:“FAILED: error in semantic analysis: Column repeated in partitioning columns”,这意味着您试图将分区的列包含在表本身的数据中。您可能确实定义了列。但是,您创建的分区会生成一个可以查询的伪列,因此必须将表列重命名为其他东西(用户不应该查询的东西!)。

分区表分别有静态分区和动态分区

创建分区表语法:

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  [(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)
  [AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)

修改分区表语法:
通过在ALTER TABLE语句中使用PARTITION子句,可以添加、重命名、交换(移动)、删除或归档分区

-- 新增分区
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location'][, PARTITION partition_spec [LOCATION 'location'], ...];
 
partition_spec:
  : (partition_column = partition_col_value, partition_column = partition_col_value, ...)

-- 重命名分区
ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec;

-- 删除分区
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec;

-- 删除分区(回收站不保留,直接删除)
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec PURGE;

-- 归档和还原归档分区表
ALTER TABLE table_name ARCHIVE PARTITION partition_spec;
ALTER TABLE table_name UNARCHIVE PARTITION partition_spec;

二.静态分区

可以根据PARTITIONED BY创建分区表,一个表可以拥有一个或者多个分区,每个分区以文件夹的形式单独存在表文件夹的目录下。
分区是以字段的形式在表结构中存在,通过describe table命令可以查看到字段存在,但是该字段不存放实际的数据内容,仅仅是分区的表示。
分区建表分为2种,一种是单分区,也就是说在表文件夹目录下只有一级文件夹目录。另外一种是多分区,表文件夹下出现多文件夹嵌套模式。

2.1 单分区测试

代码:

-- 创建分区表
create table test_part(key int,value string) partitioned by (dt string);
-- 查看分区表
describe formatted test_part;
-- 往分区表录入数据
insert into test_part partition(dt = '2020-12-29') values (1,'abc');
insert into test_part partition(dt = '2020-12-30') values (2,'def');
insert into test_part partition(dt = '2020-12-31') values (3,'ghi');

-- 新增分区
alter table  test_part add partition(dt = '2021-01-01');
-- 重命名分区
alter table test_part partition(dt = '2021-01-01') rename to partition(dt = '2021-01-02');
-- 删除分区
alter table test_part drop partition(dt='2020-12-31');
-- 删除分区(回收站不保留,直接删除)
alter table test_part drop partition(dt='2020-12-30') purge;

测试记录:

hive> 
    > create table test_part(key int,value string) partitioned by (dt string);
OK
Time taken: 0.087 seconds
hive> describe formatted test_part;
OK
# col_name              data_type               comment             
                 
key                     int                                         
value                   string                                      
                 
# Partition Information          
# col_name              data_type               comment             
                 
dt                      string                                      
                 
# Detailed Table Information             
Database:               test                     
OwnerType:              USER                     
Owner:                  root                     
CreateTime:             Tue Dec 29 15:59:02 CST 2020     
LastAccessTime:         UNKNOWN                  
Retention:              0                        
Location:               hdfs://nameservice1/user/hive/warehouse/test.db/test_part        
Table Type:             MANAGED_TABLE            
Table Parameters:                
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        numFiles                0                   
        numPartitions           0                   
        numRows                 0                   
        rawDataSize             0                   
        totalSize               0                   
        transient_lastDdlTime   1609228742          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        serialization.format    1                   
Time taken: 0.193 seconds, Fetched: 38 row(s)
hive> 


hive> insert into test_part partition(dt = '2020-12-29') values (1,'abc');
Query ID = root_20201229160327_aef35dcf-1aaa-4b30-b47b-cbe9ea3aec3e
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
20/12/29 16:03:28 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1609141291605_0003, Tracking URL = http://hp3:8088/proxy/application_1609141291605_0003/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1609141291605_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-29 16:03:35,297 Stage-1 map = 0%,  reduce = 0%
2020-12-29 16:03:42,522 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.85 sec
MapReduce Total cumulative CPU time: 2 seconds 850 msec
Ended Job = job_1609141291605_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/test_part/dt=2020-12-29/.hive-staging_hive_2020-12-29_16-03-27_864_460539835119685808-1/-ext-10000
Loading data to table test.test_part partition (dt=2020-12-29)
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 2.85 sec   HDFS Read: 4229 HDFS Write: 90 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 850 msec
OK
Time taken: 16.389 seconds
hive> 
    > insert into test_part partition(dt = '2020-12-30') values (2,'def');
Query ID = root_20201229160421_9da78dee-ff03-4d5a-9fef-b3808a69a3c6
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
20/12/29 16:04:22 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1609141291605_0004, Tracking URL = http://hp3:8088/proxy/application_1609141291605_0004/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1609141291605_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-29 16:04:29,214 Stage-1 map = 0%,  reduce = 0%
2020-12-29 16:04:36,442 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.99 sec
MapReduce Total cumulative CPU time: 2 seconds 990 msec
Ended Job = job_1609141291605_0004
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/test_part/dt=2020-12-30/.hive-staging_hive_2020-12-29_16-04-21_921_4429247941950558015-1/-ext-10000
Loading data to table test.test_part partition (dt=2020-12-30)
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 2.99 sec   HDFS Read: 4232 HDFS Write: 90 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 990 msec
OK
Time taken: 17.229 seconds
hive> 
    > insert into test_part partition(dt = '2020-12-31') values (3,'ghi');
Query ID = root_20201229160458_be37945d-a462-4ab1-b8fe-0180277c2399
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
20/12/29 16:04:58 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1609141291605_0005, Tracking URL = http://hp3:8088/proxy/application_1609141291605_0005/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1609141291605_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-29 16:05:06,866 Stage-1 map = 0%,  reduce = 0%
2020-12-29 16:05:13,062 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.03 sec
MapReduce Total cumulative CPU time: 3 seconds 30 msec
Ended Job = job_1609141291605_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/test_part/dt=2020-12-31/.hive-staging_hive_2020-12-29_16-04-58_612_6006646373659507783-1/-ext-10000
Loading data to table test.test_part partition (dt=2020-12-31)
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 3.03 sec   HDFS Read: 4232 HDFS Write: 90 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 30 msec
OK
Time taken: 16.151 seconds
hive> 
    > alter table  test_part add partition(dt = '2021-01-01');
OK
Time taken: 0.147 seconds
hive> 
    > alter table test_part partition(dt = '2021-01-01') rename to partition(dt = '2021-01-02');
OK
Time taken: 0.353 seconds
hive> 
    > alter table test_part drop partition(dt='2020-12-31');
Dropped the partition dt=2020-12-31
OK
Time taken: 0.151 seconds
hive> alter table test_part drop partition(dt='2020-12-30') purge;
Dropped the partition dt=2020-12-30
OK
Time taken: 0.135 seconds
hive> 

我们来查看下分区的存储
可以看到一份分区一个文件夹,一个分区下可能有0个或多个文件

[root@hp1 ~]# hadoop fs -ls /user/hive/warehouse/test.db/test_part
Found 2 items
drwxrwxrwt   - root hive          0 2020-12-29 16:03 /user/hive/warehouse/test.db/test_part/dt=2020-12-29
drwxrwxrwt   - root hive          0 2020-12-29 16:09 /user/hive/warehouse/test.db/test_part/dt=2021-01-02

2.2 多分区测试

代码:

-- 创建分区表
create table test_part2(key int,value string) partitioned by (prod_name string,dt string);
-- 查看分区表
describe formatted test_part2;
-- 往分区表录入数据
insert into test_part2 partition(prod_name='PROD1',dt = '2020-12-29') values (1,'abc');
insert into test_part2 partition(prod_name='PROD1',dt = '2020-12-30') values (2,'def');
insert into test_part2 partition(prod_name='PROD1',dt = '2020-12-31') values (3,'ghi');

-- 新增分区
alter table  test_part2 add partition(prod_name='PROD2',dt = '2020-12-29');
-- 重命名分区
alter table test_part2 partition(prod_name='PROD2',dt = '2020-12-29') rename to partition(prod_name='PROD3',dt = '2020-12-29');

测试记录:

hive> 
    > 
    > 
    > create table test_part2(key int,value string) partitioned by (prod_name string,dt string);
OK
Time taken: 0.09 seconds
hive> describe formatted test_part2;
OK
# col_name              data_type               comment             
                 
key                     int                                         
value                   string                                      
                 
# Partition Information          
# col_name              data_type               comment             
                 
prod_name               string                                      
dt                      string                                      
                 
# Detailed Table Information             
Database:               test                     
OwnerType:              USER                     
Owner:                  root                     
CreateTime:             Tue Dec 29 16:43:06 CST 2020     
LastAccessTime:         UNKNOWN                  
Retention:              0                        
Location:               hdfs://nameservice1/user/hive/warehouse/test.db/test_part2       
Table Type:             MANAGED_TABLE            
Table Parameters:                
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        numFiles                0                   
        numPartitions           0                   
        numRows                 0                   
        rawDataSize             0                   
        totalSize               0                   
        transient_lastDdlTime   1609231386          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        serialization.format    1                   
Time taken: 0.074 seconds, Fetched: 39 row(s)
hive> insert into test_part2 partition(prod_name='PROD1',dt = '2020-12-29') values (1,'abc');
Query ID = root_20201229164322_9ffab4bf-8e7b-449f-ac07-91f73016da13
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
20/12/29 16:43:22 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1609141291605_0006, Tracking URL = http://hp3:8088/proxy/application_1609141291605_0006/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1609141291605_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-29 16:43:29,435 Stage-1 map = 0%,  reduce = 0%
2020-12-29 16:43:35,624 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.86 sec
MapReduce Total cumulative CPU time: 3 seconds 860 msec
Ended Job = job_1609141291605_0006
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/test_part2/prod_name=PROD1/dt=2020-12-29/.hive-staging_hive_2020-12-29_16-43-22_143_1775810973752071302-1/-ext-10000
Loading data to table test.test_part2 partition (prod_name=PROD1, dt=2020-12-29)
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 3.86 sec   HDFS Read: 4319 HDFS Write: 107 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 860 msec
OK
Time taken: 16.222 seconds
hive> insert into test_part2 partition(prod_name='PROD1',dt = '2020-12-30') values (2,'def');
Query ID = root_20201229164339_1a52188f-5fcb-4275-a26b-8a36db978218
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
20/12/29 16:43:40 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1609141291605_0007, Tracking URL = http://hp3:8088/proxy/application_1609141291605_0007/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1609141291605_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-29 16:43:47,977 Stage-1 map = 0%,  reduce = 0%
2020-12-29 16:43:55,193 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.61 sec
MapReduce Total cumulative CPU time: 3 seconds 610 msec
Ended Job = job_1609141291605_0007
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/test_part2/prod_name=PROD1/dt=2020-12-30/.hive-staging_hive_2020-12-29_16-43-39_784_2410125837572976640-1/-ext-10000
Loading data to table test.test_part2 partition (prod_name=PROD1, dt=2020-12-30)
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 3.61 sec   HDFS Read: 4319 HDFS Write: 107 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 610 msec
OK
Time taken: 17.101 seconds
hive> insert into test_part2 partition(prod_name='PROD1',dt = '2020-12-31') values (3,'ghi');
Query ID = root_20201229164358_696aac78-0a49-46f0-bb8a-e408b22c48cb
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
20/12/29 16:43:58 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1609141291605_0008, Tracking URL = http://hp3:8088/proxy/application_1609141291605_0008/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1609141291605_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-29 16:44:06,382 Stage-1 map = 0%,  reduce = 0%
2020-12-29 16:44:12,570 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.46 sec
MapReduce Total cumulative CPU time: 3 seconds 460 msec
Ended Job = job_1609141291605_0008
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/test_part2/prod_name=PROD1/dt=2020-12-31/.hive-staging_hive_2020-12-29_16-43-58_167_5034454344356683760-1/-ext-10000
Loading data to table test.test_part2 partition (prod_name=PROD1, dt=2020-12-31)
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 3.46 sec   HDFS Read: 4319 HDFS Write: 107 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 460 msec
OK
Time taken: 17.146 seconds
hive> alter table  test_part2 add partition(prod_name='PROD2',dt = '2020-12-29');
OK
Time taken: 0.114 seconds
hive> alter table test_part2 partition(prod_name='PROD2',dt = '2020-12-29') rename to partition(prod_name='PROD3',dt = '2020-12-29');
OK
Time taken: 0.309 seconds
hive> 

我们来查看下分区的存储
可以看到多级分区目录为多级,每一个prod_name一个文件夹,产品文件夹下面,每一个dt是一个目录。
多分区虽然会产生比较多的文件,如果数据量不大的情况下,谨慎使用。

[root@hp1 ~]# hadoop fs -ls /user/hive/warehouse/test.db/test_part2
Found 3 items
drwxrwxrwt   - root hive          0 2020-12-29 16:43 /user/hive/warehouse/test.db/test_part2/prod_name=PROD1
drwxrwxrwt   - root hive          0 2020-12-29 16:44 /user/hive/warehouse/test.db/test_part2/prod_name=PROD2
drwxrwxrwt   - root hive          0 2020-12-29 16:44 /user/hive/warehouse/test.db/test_part2/prod_name=PROD3
[root@hp1 ~]# hadoop fs -ls /user/hive/warehouse/test.db/test_part2/prod_name=PROD1
Found 3 items
drwxrwxrwt   - root hive          0 2020-12-29 16:43 /user/hive/warehouse/test.db/test_part2/prod_name=PROD1/dt=2020-12-29
drwxrwxrwt   - root hive          0 2020-12-29 16:43 /user/hive/warehouse/test.db/test_part2/prod_name=PROD1/dt=2020-12-30
drwxrwxrwt   - root hive          0 2020-12-29 16:44 /user/hive/warehouse/test.db/test_part2/prod_name=PROD1/dt=2020-12-31
[root@hp1 ~]# hadoop fs -ls /user/hive/warehouse/test.db/test_part2/prod_name=PROD1/dt=2020-12-29
Found 1 items
-rwxrwxrwt   3 root hive          6 2020-12-29 16:43 /user/hive/warehouse/test.db/test_part2/prod_name=PROD1/dt=2020-12-29/000000_0
[root@hp1 ~]# 

三.动态分区

官方文档对静态分区和动态分区的简单解释:

Static Partition (SP) columns: in DML/DDL involving multiple partitioning columns, the columns whose values are known at COMPILE TIME (given by user).
Dynamic Partition (DP) columns: columns whose values are only known at EXECUTION TIME.

静态分区(SP)列:在涉及多个分区列的DML/DDL中,这些列的值在编译时已知(由用户给出)。
Dynamic Partition (DP)列:在执行时才知道其值的列。

上面我们测试静态分区的时候,可以看到操作分区表的时候一定要指定分区,动态分区就解决了这个问题。

3.1 动态分区测试

开启动态分区,需要调整如下配置:

set hive.exec.dynamic.partition=true;   --开启动态分区 默认为true,开启
set hive.exec.dynamic.partition.mode=nonstrict; --指定动态分区模式,默认为strict,即必须指定至少一个分区为静态分区,nonstrict模式表示允许所有的分区字段都可以使用动态分区

代码:

 CREATE TABLE emp_dynamic_partition (
 empno int,
 ename string,
 job string,
 mgr int,
 hiredate string,
 salary double,
 comm double
 )
 PARTITIONED BY (deptno int);

insert into emp_dynamic_partition partition(deptno) select * from emp;

测试记录:

hive> 
    > 
    >  CREATE TABLE emp_dynamic_partition (
    >  empno int,
    >  ename string,
    >  job string,
    >  mgr int,
    >  hiredate string,
    >  salary double,
    >  comm double
    >  )
    >  PARTITIONED BY (deptno int);
OK
Time taken: 0.106 seconds
hive> 
    > insert into emp_dynamic_partition select * from emp;
FAILED: SemanticException 1:12 Need to specify partition columns because the destination table is partitioned. Error encountered near token 'emp_dynamic_partition'
hive> 
    > set hive.exec.dynamic.partition.mode=nonstrict; 
hive> 
    > 
    > insert into emp_dynamic_partition select * from emp;
FAILED: SemanticException 1:12 Need to specify partition columns because the destination table is partitioned. Error encountered near token 'emp_dynamic_partition'
hive> 
    > 
    > insert into emp_dynamic_partition partition(deptno) select * from emp;
Query ID = root_20201229170212_e550dbfe-6d2e-49b0-83c8-9bc7d88c3144
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
20/12/29 17:02:13 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1609141291605_0009, Tracking URL = http://hp3:8088/proxy/application_1609141291605_0009/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1609141291605_0009
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-29 17:02:20,034 Stage-1 map = 0%,  reduce = 0%
2020-12-29 17:02:26,274 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.77 sec
MapReduce Total cumulative CPU time: 6 seconds 770 msec
Ended Job = job_1609141291605_0009
Stage-4 is filtered out by condition resolver.
Stage-3 is selected by condition resolver.
Stage-5 is filtered out by condition resolver.
Launching Job 3 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
20/12/29 17:02:28 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1609141291605_0010, Tracking URL = http://hp3:8088/proxy/application_1609141291605_0010/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1609141291605_0010
Hadoop job information for Stage-3: number of mappers: 3; number of reducers: 0
2020-12-29 17:02:39,398 Stage-3 map = 0%,  reduce = 0%
2020-12-29 17:02:44,560 Stage-3 map = 67%,  reduce = 0%, Cumulative CPU 3.5 sec
2020-12-29 17:02:48,675 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 5.11 sec
MapReduce Total cumulative CPU time: 5 seconds 110 msec
Ended Job = job_1609141291605_0010
Loading data to table test.emp_dynamic_partition partition (deptno=null)


         Time taken to load dynamic partitions: 0.18 seconds
         Time taken for adding to write entity : 0.001 seconds
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 6.77 sec   HDFS Read: 13606 HDFS Write: 1041 HDFS EC Read: 0 SUCCESS
Stage-Stage-3: Map: 3   Cumulative CPU: 5.11 sec   HDFS Read: 11072 HDFS Write: 635 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 880 msec
OK
Time taken: 37.464 seconds
hive> 

可以看到根据deptno自动创建了3个分区

[root@hp1 ~]# hadoop fs -ls /user/hive/warehouse/test.db/emp_dynamic_partition
Found 3 items
drwxrwxrwt   - root hive          0 2020-12-29 17:02 /user/hive/warehouse/test.db/emp_dynamic_partition/deptno=10
drwxrwxrwt   - root hive          0 2020-12-29 17:02 /user/hive/warehouse/test.db/emp_dynamic_partition/deptno=20
drwxrwxrwt   - 

3.2 动态分区和静态分区混合使用

动态分区和静态分区可以混合使用

假设此时我有一个订单表,订单有一个状态。
我需要做一个快照表,每天凌晨把所有订单的状态做一个快照,但是数据量比较大,除了按天分区外,我还想根据状态进行分区,此时我们可以将两种分区混合使用

代码:

create table test1(user_id int,prod_name string,apply_status string);
insert into test1 values (1,'Prod1','申请中');
insert into test1 values (2,'Prod1','审批中');
insert into test1 values (3,'Prod1','完成');
insert into test1 values (4,'Prod1','拒绝中');

create table test_mix_partition
(user_id    int,
 prod_name  string
)
partitioned by (dt string,apply_status string);

insert into test_mix_partition partition(dt='2020-12-29',apply_status) select user_id,prod_name,apply_status from test1;

测试记录

hive> 
    > 
    > create table test_mix_partition
    > (user_id    int,
    >  prod_name  string
    > )
    > partitioned by (dt string,apply_status string);
OK
Time taken: 0.069 seconds
hive> insert into test_mix_partition partition(dt='2020-12-29',apply_status) select user_id,prod_name,apply_status from test1;
Query ID = root_20201229172136_c5b5e88f-f2ad-4863-944e-c1dc89be7030
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
20/12/29 17:21:36 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1609141291605_0015, Tracking URL = http://hp3:8088/proxy/application_1609141291605_0015/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1609141291605_0015
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-29 17:21:43,261 Stage-1 map = 0%,  reduce = 0%
2020-12-29 17:21:49,445 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.77 sec
MapReduce Total cumulative CPU time: 3 seconds 770 msec
Ended Job = job_1609141291605_0015
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://nameservice1/user/hive/warehouse/test.db/test_mix_partition/dt=2020-12-29/.hive-staging_hive_2020-12-29_17-21-36_213_8707202021987000828-1/-ext-10000
Loading data to table test.test_mix_partition partition (dt=2020-12-29, apply_status=null)


         Time taken to load dynamic partitions: 0.175 seconds
         Time taken for adding to write entity : 0.001 seconds
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 3.77 sec   HDFS Read: 9390 HDFS Write: 423 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 770 msec
OK
Time taken: 14.95 seconds
hive> 


[root@hp1 ~]# hadoop fs -ls /user/hive/warehouse/test.db/test_mix_partition
Found 1 items
drwxrwxrwt   - root hive          0 2020-12-29 17:21 /user/hive/warehouse/test.db/test_mix_partition/dt=2020-12-29
[root@hp1 ~]# 
[root@hp1 ~]# hadoop fs -ls /user/hive/warehouse/test.db/test_mix_partition/dt=2020-12-29
Found 4 items
drwxrwxrwt   - root hive          0 2020-12-29 17:21 /user/hive/warehouse/test.db/test_mix_partition/dt=2020-12-29/apply_status=完成
drwxrwxrwt   - root hive          0 2020-12-29 17:21 /user/hive/warehouse/test.db/test_mix_partition/dt=2020-12-29/apply_status=审批中
drwxrwxrwt   - root hive          0 2020-12-29 17:21 /user/hive/warehouse/test.db/test_mix_partition/dt=2020-12-29/apply_status=拒绝中
drwxrwxrwt   - root hive          0 2020-12-29 17:21 /user/hive/warehouse/test.db/test_mix_partition/dt=2020-12-29/apply_status=申请中
[root@hp1 ~]# 

四.分区的其它操作

4.1 恢复分区

语法:

MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];

msck repair table命令执行后Hive会检测如果HDFS目录下存在 但表的metastore中不存在的partition元信息,更新到metastore中。如果有一张表已经存放好几年了,用这个命令去执行的话 半天都反应不了,所以这个命令太暴力了,生产中不推荐使用。可以用Add partition来添加分区。

4.2 归档分区

语法:

ALTER TABLE table_name ARCHIVE PARTITION partition_spec;
ALTER TABLE table_name UNARCHIVE PARTITION partition_spec;

归档分区可以减轻 name node的压力,当然,数据量非常大、文件非常多(千万级别)的时候,再考虑做归档。

4.3 交换分区

EXCHANGE PARTITION命令将一个分区从源表移动到目标表,并更改每个表的元数据。

约束:

  1. 目标表不能包含要交换的分区。
  2. 如果有索引,操作将失败。
  3. 交换分区不允许将事务表作为源或目标。或者,使用LOAD DATA或INSERT OVERWRITE命令在事务性表之间移动分区。
  4. 此命令要求源表名和目标表名具有相同的表模式。
  5. 如果模式不同,抛出以下异常:
    The tables have different schemas. Their partitions cannot be exchanged

语法:

ALTER TABLE <dest_table> EXCHANGE PARTITION (<[partial] partition spec>) WITH TABLE <src_table>

例1:

-- 创建两个分区表
CREATE TABLE T1(a string, b string) PARTITIONED BY (ds string);
CREATE TABLE T2(a string, b string) PARTITIONED BY (ds string);
ALTER TABLE T1 ADD PARTITION (ds='1');
 
-- 将t1表的分区 ds=‘1' 移动到t2表
ALTER TABLE T2 EXCHANGE PARTITION (ds='1') WITH TABLE T1;

测试记录:

hive> 
    > CREATE TABLE T1(a string, b string) PARTITIONED BY (ds string);
OK
Time taken: 0.081 seconds
hive> CREATE TABLE T2(a string, b string) PARTITIONED BY (ds string);
OK
Time taken: 0.085 seconds
hive> ALTER TABLE T1 ADD PARTITION (ds='1');
OK
Time taken: 0.103 seconds
hive> ALTER TABLE T2 EXCHANGE PARTITION (ds='1') WITH TABLE T1;
OK
Time taken: 0.303 seconds
hive> 

例2

-- 创建2个多分区列的分区表
CREATE TABLE T1 (a string, b string) PARTITIONED BY (ds string, hr string);
CREATE TABLE T2 (a string, b string) PARTITIONED BY (ds string, hr string);
ALTER TABLE T1 ADD PARTITION (ds = '1', hr = '00');
ALTER TABLE T1 ADD PARTITION (ds = '1', hr = '01');
ALTER TABLE T1 ADD PARTITION (ds = '1', hr = '03');
 
--  ds='1' 的三个分区同时移动到 t1分区表
ALTER TABLE T2 EXCHANGE PARTITION (ds='1') WITH TABLE T1;

测试记录:

hive> 
    > 
    > CREATE TABLE T1 (a string, b string) PARTITIONED BY (ds string, hr string);
OK
Time taken: 0.092 seconds
hive> CREATE TABLE T2 (a string, b string) PARTITIONED BY (ds string, hr string);
OK
Time taken: 0.073 seconds
hive> ALTER TABLE T1 ADD PARTITION (ds = '1', hr = '00');
OK
Time taken: 0.106 seconds
hive> ALTER TABLE T1 ADD PARTITION (ds = '1', hr = '01');
OK
Time taken: 0.122 seconds
hive> ALTER TABLE T1 ADD PARTITION (ds = '1', hr = '03');
OK
Time taken: 0.093 seconds
hive> ALTER TABLE T2 EXCHANGE PARTITION (ds='1') WITH TABLE T1;
OK
Time taken: 0.337 seconds
hive> 

参考

1.https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables

本文来自:墨天轮

感谢作者:墨天轮

查看原文:大数据开发之Hive篇19-Hive分区表详解

1272 次点击  
加入收藏 微博
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传