Hive常用命令
hive文档
https://cwiki.apache.org/confluence/display/Hive/Home
开启本地模式
sethive.exec.mode.local.auto=true;
DDLOperations
创建表
hive>CREATETABLEpokes(fooINT,barSTRING);
创建表并创建索引字段ds
hive>CREATETABLEinvites(fooINT,barSTRING)PARTITIONEDBY(dsSTRING);
显示所有表
hive>SHOWTABLES;
按正条件(正则表达式)显示表,
hive>SHOWTABLES'.*s';
表添加一列
hive>ALTERTABLEpokesADDCOLUMNS(new_colINT);
添加一列并增加列字段注释
hive>ALTERTABLEinvitesADDCOLUMNS(new_col2INTCOMMENT'acomment');
更改表名
hive>ALTERTABLEeventsRENAMETO3koobecaf;
删除列
hive>DROPTABLEpokes;
元数据存储
将文件中的数据加载到表中
hive>LOADDATALOCALINPATH'./examples/files/kv1.txt'OVERWRITEINTOTABLEpokes;
加载本地数据,同时给定分区信息
hive>LOADDATALOCALINPATH'./examples/files/kv2.txt'OVERWRITEINTOTABLEinvitesPARTITION(ds='2008-08-15');
加载DFS数据,同时给定分区信息
hive>LOADDATAINPATH'/user/myname/kv2.txt'OVERWRITEINTOTABLEinvitesPARTITION(ds='2008-08-15');
TheabovecommandwillloaddatafromanHDFSfile/directorytothetable.NotethatloadingdatafromHDFSwillresultinmovingthefile/directory.Asaresult,theoperationisalmostinstantaneous.
SQL操作
按先件查询
hive>SELECTa.fooFROMinvitesaWHEREa.ds='<DATE>';
将查询数据输出至目录
hive>INSERTOVERWRITEDIRECTORY'/tmp/hdfs_out'SELECTa.*FROMinvitesaWHEREa.ds='<DATE>';
将查询结果输出至本地目录
hive>INSERTOVERWRITELOCALDIRECTORY'/tmp/local_out'SELECTa.*FROMpokesa;
选择所有列到本地目录
hive>INSERTOVERWRITETABLEeventsSELECTa.*FROMprofilesa;
hive>INSERTOVERWRITETABLEeventsSELECTa.*FROMprofilesaWHEREa.key<100;
hive>INSERTOVERWRITELOCALDIRECTORY'/tmp/reg_3'SELECTa.*FROMeventsa;
hive>INSERTOVERWRITEDIRECTORY'/tmp/reg_4'selecta.invites,a.pokesFROMprofilesa;
hive>INSERTOVERWRITEDIRECTORY'/tmp/reg_5'SELECTCOUNT(1)FROMinvitesaWHEREa.ds='<DATE>';
hive>INSERTOVERWRITEDIRECTORY'/tmp/reg_5'SELECTa.foo,a.barFROMinvitesa;
hive>INSERTOVERWRITELOCALDIRECTORY'/tmp/sum'SELECTSUM(a.pc)FROMpc1a;
将一个表的统计结果插入另一个表中
hive>FROMinvitesaINSERTOVERWRITETABLEeventsSELECTa.bar,count(1)WHEREa.foo>0GROUPBYa.bar;
hive>INSERTOVERWRITETABLEeventsSELECTa.bar,count(1)FROMinvitesaWHEREa.foo>0GROUPBYa.bar;
JOIN
hive>FROMpokest1JOINinvitest2ON(t1.bar=t2.bar)INSERTOVERWRITETABLEeventsSELECTt1.bar,t1.foo,t2.foo;
将多表数据插入到同一表中
FROMsrc
INSERTOVERWRITETABLEdest1SELECTsrc.*WHEREsrc.key<100
INSERTOVERWRITETABLEdest2SELECTsrc.key,src.valueWHEREsrc.key>=100andsrc.key<200
INSERTOVERWRITETABLEdest3PARTITION(ds='2008-04-08',hr='12')SELECTsrc.keyWHEREsrc.key>=200andsrc.key<300
INSERTOVERWRITELOCALDIRECTORY'/tmp/dest4.out'SELECTsrc.valueWHEREsrc.key>=300;
将文件流直接插入文件
hive>FROMinvitesaINSERTOVERWRITETABLEeventsSELECTTRANSFORM(a.foo,a.bar)AS(oof,rab)USING'/bin/cat'WHEREa.ds>'2008-08-09';
Thisstreamsthedatainthemapphasethroughthescript/bin/cat(likehadoopstreaming).Similarly-streamingcanbeusedonthereduceside(pleaseseetheHiveTutorialorexamples)
实际示例
创建一个表
CREATETABLEu_data(
useridINT,
movieidINT,
ratingINT,
unixtimeSTRING)
ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\t'
STOREDASTEXTFILE;
下载示例数据文件,并解压缩
wgethttp://www.grouplens.org/system/files/ml-data.tar__0.gz
tarxvzfml-data.tar__0.gz
加载数据到表中
LOADDATALOCALINPATH'ml-data/u.data'
OVERWRITEINTOTABLEu_data;
统计数据总量
SELECTCOUNT(1)FROMu_data;
现在做一些复杂的数据分析
创建一个weekday_mapper.py:文件,作为数据按周进行分割
importsys
importdatetime
forlineinsys.stdin:
line=line.strip()
userid,movieid,rating,unixtime=line.split('\t')
生成数据的周信息
weekday=datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print'\t'.join([userid,movieid,rating,str(weekday)])
使用映射脚本
//创建表,按分割符分割行中的字段值
CREATETABLEu_data_new(
useridINT,
movieidINT,
ratingINT,
weekdayINT)
ROWFORMATDELIMITED
FIELDSTERMINATEDBY'\t';
//将python文件加载到系统
addFILEweekday_mapper.py;
将数据按周进行分割
INSERTOVERWRITETABLEu_data_new
SELECT
TRANSFORM(userid,movieid,rating,unixtime)
USING'pythonweekday_mapper.py'
AS(userid,movieid,rating,weekday)
FROMu_data;
SELECTweekday,COUNT(1)
FROMu_data_new
GROUPBYweekday;
处理ApacheWeblog数据
将WEB日志先用正则表达式进行组合,再按需要的条件进行组合输入到表中
addjar../build/contrib/hive_contrib.jar;
CREATETABLEapachelog(
hostSTRING,
identitySTRING,
userSTRING,
timeSTRING,
requestSTRING,
statusSTRING,
sizeSTRING,
refererSTRING,
agentSTRING)
ROWFORMATSERDE'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITHSERDEPROPERTIES(
"input.regex"="([^]*)([^]*)([^]*)(-|\\[[^\\]]*\\])([^\"]*|\"[^\"]*\")(-|[0-9]*)(-|[0-9]*)(?:([^\"]*|\"[^\"]*\")([^\"]*|\"[^\"]*\"))?",
"output.format.string"="%1$s%2$s%3$s%4$s%5$s%6$s%7$s%8$s%9$s"
)
STOREDASTEXTFILE;