HIVE 第四章 数据操作
数据操作篇
加载数据(会生成partition,如果不存在的话;local字段表示为是本机目录,如果不加,则表示为hdfs上的目录;overwrite关键字表示删除目标目录,当没有则保留,但会覆盖同名旧目录)一些限制条件:使用partition,则table必须是partition的;hive不会去验证你的data是服务schema的,但是hive会验证文件是否符合表定义,如sequencefile的table,file也必须是sequence files
load data local inpath '/home/hadoop/hive/data'
overwrite into table employees
partition(country='china',state='beijing');
静态插入数据:
插入数据(select * from employee 这样选择一个分区表,会将其partition的column也返回)
insert overwrite table employees
partition (country='US',state='dallas')
select name,salary,subordinate,deductions,address from employees se
where se.country='china' and se.state='beijing'
插入多条数据
FROM staged_employees se
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'OR')
SELECT * WHERE se.cnty = 'US' AND se.st = 'OR'
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'CA')
SELECT * WHERE se.cnty = 'US' AND se.st = 'CA'
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'IL')
SELECT * WHERE se.cnty = 'US' AND se.st = 'IL';
动态插入并创建partition:
INSERT OVERWRITE TABLE employees
PARTITION (country, state)
SELECT ..., se.cnty, se.st
FROM staged_employees se;
动态静态创建partition
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state)
SELECT ..., se.cnty, se.st
FROM staged_employees se
WHERE se.cnty = 'US';
有一些参数控制,动态创建partition
hive.exec.dynamic.partition
false
Set to true to enable dynamic partitioning.
hive.exec.dynamic.partition.mode
strict
Set to nonstrict to enable all partitions to be determined
dynamically.
hive.exec.max.dynamic.partitions.pernode
100
The maximum number of dynamic partitions that can be created
by each mapper or reducer. Raises a fatal error if one
mapper or reducer attempts to create more than the threshold.
hive.exec.max.dynamic.partitions
+1000
The total number of dynamic partitions that can be created by
one statement with dynamic partitioning. Raises a fatal error
if the limit is exceeded.
hive.exec.max.created.files
100000
The maximum total number of files that can be created globally.
A Hadoop counter is used to track the number of files created.
Raises a fatal error if the limit is exceeded.
创建表,并加载数据
CREATE TABLE ca_employees
AS SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';
导出数据
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/ca_employees'
SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';
批量导出数据
FROM staged_employees se
INSERT OVERWRITE DIRECTORY '/tmp/or_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'OR'
INSERT OVERWRITE DIRECTORY '/tmp/ca_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'CA'
INSERT OVERWRITE DIRECTORY '/tmp/il_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'IL';