hive sql对hdfs的操作最终都会转化为mr任务,下面介绍如何将已经存在的hdfs文件“导入”hive表,很简单
条件及要求:
1)hdfs文件为经过lzo压缩的seqFile
2)seqFile数据样例
127.0.0.1 ae8186613401633040361974|1340163304040|0.1|hsf.qa.test|yuanfeng|test_07_01
其中key是hostIp, value为一条日志,格式为:traceid|time|rpcid|appname|querykey|msg
3)hive表是外在分区表
步骤1. 建立hive表
CREATE EXTERNAL TABLE biz_eagleeye (traceId STRING, time STRING, rpcId STRING,
appName STRING, queryKey STRING, msg STRING, kvMap STRING)
PARTITIONED BY(pt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\|'
STORED AS SEQUENCEFILE;
其中EXTERNAL和PARTITIONED关键字指明表为外表和分区表,STORED AS SEQUENCEFILE是专门指定加载seqFile数据源的,如果是普通文本可换成TEXTFILE
步骤2.从hdfs加载数据
ALTER TABLE biz_eagleeye ADD PARTITION (pt='2013-02-28')
LOCATION '/group/tlog/zhiyuan';
通过LOCATION关键字给出hdfs文件路径,并给出分区值。特别说明下,加载seqFile时hive默认过滤掉key(将key看做null)然后按指明的分隔符(这里是’\|‘)对value进行切分,如果需要考虑key或较复杂的切分字段方式可以指定自定义的mapper和reducer:
mapred.mapper.class = org.apache.hadoop.hive.ql.exec.ExecMapper