hive sort by order by
selecta.*frompokesasortbya.foodesc;
http://blog.sina.com.cn/s/blog_6ff05a2c0101eaxf.html
在hive中不光有orderby操作,还有个sortby操作。两者执行的都是排序的操作,但有存在很大的不同。
还是用上次orderby的例子来说明。
测试用例
hive>select*fromtest09;
OK
100tom
200mary
300kate
400tim
Timetaken:0.061seconds
hive>select*fromtest09sortbyid;
TotalMapReducejobs=1
LaunchingJob1outof1
Numberofreducetasksnotspecified.Defaultingtojobconfvalueof:2
Inordertochangetheaverageloadforareducer(inbytes):
sethive.exec.reducers.bytes.per.reducer=
Inordertolimitthemaximumnumberofreducers:
sethive.exec.reducers.max=
Inordertosetaconstantnumberofreducers:
setmapred.reduce.tasks=
StartingJob=job_201105020924_0068,TrackingURL=http://hadoop00:50030/jobdetails.jsp?jobid=job_201105020924_0068
KillCommand=/home/hjl/hadoop/bin/../bin/hadoopjob-Dmapred.job.tracker=hadoop00:9001-killjob_201105020924_0068
2011-05-0305:39:21,389Stage-1map=0%,reduce=0%
2011-05-0305:39:23,410Stage-1map=50%,reduce=0%
2011-05-0305:39:25,430Stage-1map=100%,reduce=0%
2011-05-0305:39:30,470Stage-1map=100%,reduce=50%
2011-05-0305:39:32,493Stage-1map=100%,reduce=100%
EndedJob=job_201105020924_0068
OK
100tom
300kate
200mary
400tim
Timetaken:17.783seconds
结果看起来和orderby差不多,但是sortby是不受hive.mapred.mode参数影响,无论hive.mapred.mode在什么模式都可以。
从上面的Numberofreducetasksnotspecified.Defaultingtojobconfvalueof:2可以看得出来,此时共启动了2个reduce。
实际上sortby控制的是每个reduce产生的文件都是排序的(从上面的结果可以看出,整体上并不保证有序),这样对多个已经排序好的文件做一次归并排序就ok了。
比用orderby的时候,仅仅有单个reduce要好得多。
我们把上面的结果写到文件中就看得清楚的多了。
hive>insertoverwritelocaldirectory‘/home/hjl/sunwg/qqq’select*fromtest09sortbyid;
TotalMapReducejobs=1
LaunchingJob1outof1
Numberofreducetasksnotspecified.Defaultingtojobconfvalueof:2
Inordertochangetheaverageloadforareducer(inbytes):
sethive.exec.reducers.bytes.per.reducer=
Inordertolimitthemaximumnumberofreducers:
sethive.exec.reducers.max=
Inordertosetaconstantnumberofreducers:
setmapred.reduce.tasks=
StartingJob=job_201105020924_0069,TrackingURL=http://hadoop00:50030/jobdetails.jsp?jobid=job_201105020924_0069
KillCommand=/home/hjl/hadoop/bin/../bin/hadoopjob-Dmapred.job.tracker=hadoop00:9001-killjob_201105020924_0069
2011-05-0305:41:27,913Stage-1map=0%,reduce=0%
2011-05-0305:41:30,939Stage-1map=100%,reduce=0%
2011-05-0305:41:37,993Stage-1map=100%,reduce=50%
2011-05-0305:41:41,023Stage-1map=100%,reduce=100%
EndedJob=job_201105020924_0069
Copyingdatatolocaldirectory/home/hjl/sunwg/qqq
Copyingdatatolocaldirectory/home/hjl/sunwg/qqq
4Rowsloadedto/home/hjl/sunwg/qqq
OK
Timetaken:18.496seconds
[hjl@sunwgsrc]$ll/home/hjl/sunwg/qqq
total8
-rwxrwxrwx1hjlhjl17May305:41attempt_201105020924_0069_r_000000_0
-rwxrwxrwx1hjlhjl17May305:41attempt_201105020924_0069_r_000001_0
此时产生了2个文件,分别查看每个文件的内容。
[hjl@sunwgsrc]$cat/home/hjl/sunwg/qqq/attempt_201105020924_0069_r_000000_0
100tom
300kate
[hjl@sunwgsrc]$cat/home/hjl/sunwg/qqq/attempt_201105020924_0069_r_000001_0
200mary
400tim
可以看得出来每个文件的内部都是排好顺序的。
orderby和sortby都可以实现排序的功能,不过具体怎么使用还得根据情况,如果数据量不是太大的情况可以使用orderby,如果数据库过于庞大,最好还是使用sortby。
本文转自http://www.oratea.net/?p=624