Hive常用命令

hive文档

https://cwiki.apache.org/confluence/display/Hive/Home

开启本地模式

sethive.exec.mode.local.auto=true;

DDLOperations

创建表

hive>CREATETABLEpokes(fooINT,barSTRING);

创建表并创建索引字段ds

hive>CREATETABLEinvites(fooINT,barSTRING)PARTITIONEDBY(dsSTRING);

显示所有表

hive>SHOWTABLES;

按正条件(正则表达式)显示表,

hive>SHOWTABLES'.*s';

表添加一列

hive>ALTERTABLEpokesADDCOLUMNS(new_colINT);

添加一列并增加列字段注释

hive>ALTERTABLEinvitesADDCOLUMNS(new_col2INTCOMMENT'acomment');

更改表名

hive>ALTERTABLEeventsRENAMETO3koobecaf;

删除列

hive>DROPTABLEpokes;

元数据存储

将文件中的数据加载到表中

hive>LOADDATALOCALINPATH'./examples/files/kv1.txt'OVERWRITEINTOTABLEpokes;

加载本地数据,同时给定分区信息

hive>LOADDATALOCALINPATH'./examples/files/kv2.txt'OVERWRITEINTOTABLEinvitesPARTITION(ds='2008-08-15');

加载DFS数据,同时给定分区信息

hive>LOADDATAINPATH'/user/myname/kv2.txt'OVERWRITEINTOTABLEinvitesPARTITION(ds='2008-08-15');

TheabovecommandwillloaddatafromanHDFSfile/directorytothetable.NotethatloadingdatafromHDFSwillresultinmovingthefile/directory.Asaresult,theoperationisalmostinstantaneous.

SQL操作

按先件查询

hive>SELECTa.fooFROMinvitesaWHEREa.ds='<DATE>';

将查询数据输出至目录

hive>INSERTOVERWRITEDIRECTORY'/tmp/hdfs_out'SELECTa.*FROMinvitesaWHEREa.ds='<DATE>';

将查询结果输出至本地目录

hive>INSERTOVERWRITELOCALDIRECTORY'/tmp/local_out'SELECTa.*FROMpokesa;

选择所有列到本地目录

hive>INSERTOVERWRITETABLEeventsSELECTa.*FROMprofilesa;

hive>INSERTOVERWRITETABLEeventsSELECTa.*FROMprofilesaWHEREa.key<100;

hive>INSERTOVERWRITELOCALDIRECTORY'/tmp/reg_3'SELECTa.*FROMeventsa;

hive>INSERTOVERWRITEDIRECTORY'/tmp/reg_4'selecta.invites,a.pokesFROMprofilesa;

hive>INSERTOVERWRITEDIRECTORY'/tmp/reg_5'SELECTCOUNT(1)FROMinvitesaWHEREa.ds='<DATE>';

hive>INSERTOVERWRITEDIRECTORY'/tmp/reg_5'SELECTa.foo,a.barFROMinvitesa;

hive>INSERTOVERWRITELOCALDIRECTORY'/tmp/sum'SELECTSUM(a.pc)FROMpc1a;

将一个表的统计结果插入另一个表中

hive>FROMinvitesaINSERTOVERWRITETABLEeventsSELECTa.bar,count(1)WHEREa.foo>0GROUPBYa.bar;

hive>INSERTOVERWRITETABLEeventsSELECTa.bar,count(1)FROMinvitesaWHEREa.foo>0GROUPBYa.bar;

JOIN

hive>FROMpokest1JOINinvitest2ON(t1.bar=t2.bar)INSERTOVERWRITETABLEeventsSELECTt1.bar,t1.foo,t2.foo;

将多表数据插入到同一表中

FROMsrc

INSERTOVERWRITETABLEdest1SELECTsrc.*WHEREsrc.key<100

INSERTOVERWRITETABLEdest2SELECTsrc.key,src.valueWHEREsrc.key>=100andsrc.key<200

INSERTOVERWRITETABLEdest3PARTITION(ds='2008-04-08',hr='12')SELECTsrc.keyWHEREsrc.key>=200andsrc.key<300

INSERTOVERWRITELOCALDIRECTORY'/tmp/dest4.out'SELECTsrc.valueWHEREsrc.key>=300;

将文件流直接插入文件

hive>FROMinvitesaINSERTOVERWRITETABLEeventsSELECTTRANSFORM(a.foo,a.bar)AS(oof,rab)USING'/bin/cat'WHEREa.ds>'2008-08-09';

Thisstreamsthedatainthemapphasethroughthescript/bin/cat(likehadoopstreaming).Similarly-streamingcanbeusedonthereduceside(pleaseseetheHiveTutorialorexamples)

实际示例

创建一个表

CREATETABLEu_data(

useridINT,

movieidINT,

ratingINT,

unixtimeSTRING)

ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\t'

STOREDASTEXTFILE;

下载示例数据文件,并解压缩

wgethttp://www.grouplens.org/system/files/ml-data.tar__0.gz

tarxvzfml-data.tar__0.gz

加载数据到表中

LOADDATALOCALINPATH'ml-data/u.data'

OVERWRITEINTOTABLEu_data;

统计数据总量

SELECTCOUNT(1)FROMu_data;

现在做一些复杂的数据分析

创建一个weekday_mapper.py:文件,作为数据按周进行分割

importsys

importdatetime

forlineinsys.stdin:

line=line.strip()

userid,movieid,rating,unixtime=line.split('\t')

生成数据的周信息

weekday=datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()

print'\t'.join([userid,movieid,rating,str(weekday)])

使用映射脚本

//创建表,按分割符分割行中的字段值

CREATETABLEu_data_new(

useridINT,

movieidINT,

ratingINT,

weekdayINT)

ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\t';

//将python文件加载到系统

addFILEweekday_mapper.py;

将数据按周进行分割

INSERTOVERWRITETABLEu_data_new

SELECT

TRANSFORM(userid,movieid,rating,unixtime)

USING'pythonweekday_mapper.py'

AS(userid,movieid,rating,weekday)

FROMu_data;

SELECTweekday,COUNT(1)

FROMu_data_new

GROUPBYweekday;

处理ApacheWeblog数据

将WEB日志先用正则表达式进行组合,再按需要的条件进行组合输入到表中

addjar../build/contrib/hive_contrib.jar;

CREATETABLEapachelog(

hostSTRING,

identitySTRING,

userSTRING,

timeSTRING,

requestSTRING,

statusSTRING,

sizeSTRING,

refererSTRING,

agentSTRING)

ROWFORMATSERDE'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

WITHSERDEPROPERTIES(

"input.regex"="([^]*)([^]*)([^]*)(-|\\[[^\\]]*\\])([^\"]*|\"[^\"]*\")(-|[0-9]*)(-|[0-9]*)(?:([^\"]*|\"[^\"]*\")([^\"]*|\"[^\"]*\"))?",

"output.format.string"="%1$s%2$s%3$s%4$s%5$s%6$s%7$s%8$s%9$s"

)

STOREDASTEXTFILE;

相关推荐