Hive手写SQL案例

taisenki

2019-12-14

1-请详细描述将一个有结构的文本文件student.txt导入到一个hive表中的步骤，及其关键字

假设student.txt 有以下几列：id,name,gender三列
1-创建数据库 create database student_info;
2-创建hive表 student

create external table student_info.student(
id string comment ‘学生id‘,
name string comment ‘学生姓名‘,
gender string comment ‘学生性别‘
) comment "学生信息表"
row format delimited fields terminated by ‘\t‘
line terminated by ‘\n‘
stored as textfile
location "/user/root/student";

3-加载数据

load data local inpath ‘/root/student.txt‘ into table student_info.student  location "/user/root/student" ;

4- 进入hive-cli，查看相应的表结构
select * from student_info.student limit 10；

划重点：要回手写代码

2-利用HQL实现以下功能

2-1-创建表

创建员工基本信息表(EmployeeInfo)，字段包括(员工 ID，员工姓名，员工身份证号，性别，年龄，所属部门，岗位，入职公司时间，离职公司时间)，分区字段为入职公司时间，其行分隔符为”\n “，字段分隔符为”\t “。其中所属部门包括行政部、财务部、研发部、教学部，其对应岗位包括行政经理、行政专员、财务经理、财务专员、研发工程师、测试工程师、实施工程师、讲师、助教、班主任等，时间类型值如：2018-05-10 11:00:00
创建员工收入表(IncomeInfo)，字段包括(员工 ID，员工姓名，收入金额，收入所属
月份，收入类型，收入薪水的时间)，分区字段为发放薪水的时间，其中收入类型包括薪资、奖金、公司福利、罚款四种情况 ; 时间类型值如：2018-05-10 11:00:00。

注意：时间类型是2018-05-10 11:00:00，需要对字段进行处理

创建员工基本信息表

create external table test.employee_info(
id string comment ‘员工id‘,
name string comment ‘员工姓名‘,
indentity_card string comment ‘身份证号‘,
gender string comment ‘性别‘,
department string comment ‘所属部门‘,
post string comment ‘岗位‘,
hire_date string comment ‘入职时间‘,
departure_date string comment ‘离职时间‘
) comment "员工基本信息表"
partitioned by (day string comment "员工入职时间")
row format delimited fields terminated by ‘\t‘
lines terminated by  ‘\n‘
stored as textfile 
location ‘/user/root/employee‘;

创建员工收入表

create external table test.income_info(
id string comment ‘员工id‘,
name string comment ‘员工姓名‘,
income_data string comment ‘收入‘,
income_month string comment ‘收入所属月份‘,
income_type string comment ‘收入类型‘,
income_datetime string comment ‘收入薪水时间‘
) comment ‘员工收入表‘
partitioned by (day string comment "员工发放薪水时间")
row format delimited fields terminated by ‘\t‘
lines terminated by  ‘\n‘
stored as textfile 
location ‘/user/root/income‘;

2-2用 HQL 实现，求公司每年的员工费用总支出各是多少，并按年份降序排列?

重点对时间类型 2018-05-10 11:00:00 进行内置函数处理
需要读取income_info全量表，按照分区时间进行聚合，因为收入类型里面有罚款一项，所以需要在员工发放的钱中扣除罚款的钱。
不采用join、对数据一次遍历输出结果，
对于大数据量的情况下，要考虑对数据进行一次遍历求出结果

select 
    income_year,(income_data-(nvl(penalty_data,0))) as company_cost
from
(
    -- 统计员工收入金额和罚款金额，输出 2019 500 10
    select 
        income_year,
        sum(case when income_type!=‘罚款‘ then data_total else 0 end) as income_data,
        sum(case when income_type=‘罚款‘ then data_total else 0 end) as penalty_data
    from
    (
    -- 按照年份、收入类型求收入金额
    select 
        year(to_date(income_datetime)) as income_year,
        income_type,
        sum(income_data) as data_total
    from
        test.income_info
    group by 
        year(to_date(income_datetime)) ,income_type
    ) tmp_a
    group by  tmp_a.income_year
) as  temp
order by income_year desc;

2-3用 HQL 实现，求各部门每年的员工费用总支出各是多少，并按年份降序，按部门的支出升序排列？

保证对数据的一次遍历

--根据id关联得出department,和消费类型
select 
    income_year,department,
    (sum(case when income_type!=‘罚款‘ then income_data else 0 end) - sum(case when income_type=‘罚款‘ then income_data else 0 end) ) as department_cost
from
(
    -- 先对员工进行薪资类别的聚合统计
    select 
        id,year(to_date(income_datetime)) as income_year,income_type,sum(income_data) as income_data
    from 
        test.income_info
    group by 
    year(to_date(income_datetime)),id,income_type
) temp_a
inner join
    test.employee_info b
on
    temp_a.id=b.id
group by
    department,income_year
order by income_year desc , department_cost asc;

2-4用 HQL 实现，求各部门历史所有员工费用总支出各是多少，按总支出多少排名降序，遇到值相等情况，不留空位。

根据2-3中的中间结果进行修改
注意历史上所有的数据

select department,department_cost,dense_rank() over(order by department_cost desc) as cost_rank
from
(
--根据id关联得出department,和消费类型
select 
    department,
    (sum(case when income_type!=‘罚款‘ then income_data else 0 end) - sum(case when income_type=‘罚款‘ then income_data else 0 end) ) as department_cost
from
(
    -- 先对员工进行薪资类别的聚合统计
    select 
        id,income_type,sum(income_data) as income_data
    from 
        test.income_info
    group by 
    id,income_type
) temp_a
inner join
    test.employee_info b
on
    temp_a.id=b.id
group by
    department
) tmp_c ;

2-5 用 HQL 实现，创建并生成员工薪资收入动态变化表，即员工 ID，员工姓名，员工本月薪资，本月薪资发放时间，员工上月薪资，上月薪资发放时间。分区字段为本月薪资发放时间。

感觉应该使用动态分区插入的特性？-但是不知道该怎么写
先创建表，再采用insert into table **** select ***
要考虑到离职和入职的员工，这一点需要考虑到，full join
两张表进行full join，过滤day is null
需要concat year month to_date内置函数处理
这个题需要考虑的比较多

create external table test.income_dynamic(
id string comment ‘员工id‘,
name string comment ‘员工姓名‘,
income_data_current string comment ‘本月收入‘,
income_datetime_current string comment 本月‘收入薪水时间‘,
income_data_last   string comment ‘上月收入‘,
income_datetime_last string comment ‘上月收入薪水时间‘,
) comment ‘员工收入动态表‘
partitioned by (day string comment "员工本月发放薪水时间")
row format delimited fields terminated by ‘\t‘
lines terminated by  ‘\n‘
stored as textfile 
location ‘/user/root/income‘;
-- ------------------------------------------------------------------------------
-- 动态分区插入
-- 插入语句
-- 采用full join
insert into table test.income_dynamic partition(day)
select 
    (case when id_a is not null then id_a else id_b end ) as id,
    (case when name_a is not null then name_a else name_b end )  as name ,
    income_data,income_datetime,income_data_b,income_datetime_b,day
from
    (
    -- 选出表中所有的数据
    select
        id as id_a,name as name_a,income_data,income_datetime,day,concat(year(to_date(day)),month(to_date(day))) as day_flag
    from 
        test.income_info
    where 
        income_type=‘薪资‘ ) tmp_a
full outer join
    (
    -- 将表中的收到薪水的日期整体加一个月
    select
        id as id_b,name as name_b,income_data as income_data_b,income_datetime as  income_datetime_b,concat(year(add_months(to_date(day),1)),month(add_months(to_date(day),1))) as   month_flag
    from 
        test.income_info
    where 
        income_type=‘薪资‘
    ) tmp_b
    on 
        tmp_a.day_flag=tmp_b.month_flag
    and 
        tmp_a.id_a=tmp_b.id_b
where day is not null
;

2-6 用 HQL 实现，薪资涨幅方面，2018 年 5 月份谁的工资涨的最多，谁的涨幅最大？

再2-5的基础上做比较简单，仅仅利用select部分即可；或者是再2-5的基础上做就行

Hive行列转换

１、问题
hive如何将
a       b       1
a       b       2
a       b       3
c       d       4
c       d       5
c       d       6
变为：
a       b       1,2,3
c       d       4,5,6
-------------------------------------------------------------------------------------------
２、数据
test.txt
a       b       1 
a       b       2 
a       b       3 
c       d       4 
c       d       5 
c       d       6
------------------------------------------------------------------------------------------- 
３、答案
1.建表
drop table tmp_jiangzl_test;
create table tmp_jiangzl_test
(
col1 string,
col2 string,
col3 string
)
row format delimited fields terminated by ‘\t‘
stored as textfile;
-- 加载数据
load data local inpath ‘/home/jiangzl/shell/test.txt‘ into table tmp_jiangzl_test;
2.处理
select col1,col2,concat_ws(‘,‘,collect_set(col3)) 
from tmp_jiangzl_test  
group by col1,col2;
---------------------------------------------------------------------------------------
collect_set/concat_ws语法参考链接：https://blog.csdn.net/waiwai3/article/details/79071544
https://blog.csdn.net/yeweiouyang/article/details/41286469   [Hive]用concat_w实现将多行记录合并成一行
---------------------------------------------------------------------------------------
二、列转行
１、问题
hive如何将
a       b       1,2,3
c       d       4,5,6
变为：
a       b       1
a       b       2
a       b       3
c       d       4
c       d       5
c       d       6
---------------------------------------------------------------------------------------------
2、答案
1.建表

drop table tmp_jiangzl_test;
create table tmp_jiangzl_test
(
col1 string,
col2 string,
col3 string
)
row format delimited fields terminated by ‘\t‘
stored as textfile;
处理：
select col1, col2, col5
from tmp_jiangzl_test a 
lateral  view explode(split(col3,‘,‘)) b AS col5;
---------------------------------------------------------------------------------------
lateral  view 语法参考链接：
https://blog.csdn.net/clerk0324/article/details/58600284

Hive实现wordcount

1.创建数据库
create database wordcount;
2.创建外部表
create external table word_data(line string) row format delimited fields terminated by ‘,‘ location ‘/home/hadoop/worddata‘;
3.映射数据表
load data inpath ‘/home/hadoop/worddata‘ into table word_data;
4.这里假设我们的数据存放在hadoop下，路径为：/home/hadoop/worddata，里面主要是一些单词文件，内容大概为：
hello man
what are you doing now
my running
hello
kevin
hi man
执行了上述hql就会创建一张表src_wordcount，内容是这些文件的每行数据，每行数据存在字段line中，select * from word_data;就可以看到这些数据

5.根据MapReduce的规则，我们需要进行拆分，把每行数据拆分成单词，这里需要用到一个hive的内置表生成函数（UDTF）：explode(array)，参数是array，其实就是行变多列：

create table words(word string);
insert into table words select explode(split(line, " ")) as word from word_data;

6.查看words表内容
OK
hello
man
what
are
you
doing
now
my
running
hello
kevin
hi
man
split是拆分函数，跟java的split功能一样，这里是按照空格拆分，所以执行完hql语句，words表里面就全部保存的单个单词
7.group by统计单词
    select word, count(*) from wordcount.words group by word;
wordcount.words 库名称.表名称，group by word这个word是create table words(word string) 命令创建的word string

结果：
are     1
doing   1
hello   2
hi      1
kevin   1
man     2
my      1
now     1
running 1
what    1
you     1

Hive取TopN

rank() over()
dense_rank() over()
row_number() over()

求取指定状态下的订单id

给一张订单表，统计只购买过面粉的用户；（重点在于仅仅购买过面粉的客户）
eg：order:order_id,buyer_id,order_time.....

在保证一次遍历的情况下,重点是O(1)复杂度

select buyer_id
from
(
select buyer_id,sum(case when order_id=‘面粉‘ then 0 else 1 end) as flag
from order
) as tmp
where flag=0;

微博体系中互粉的有多少组

在微博粉丝表中，互相关注的人有多少组，例如：A-->B;B-->A；A和B互粉，称为一组。
表结构：id,keep_id,time.... (id,keep_id可作为联合主键)
借助Hive进行实现

select count(*)/2 as weibo_relation_number
from
(
  (select concat(id,keep_id) as flag from weibo_relation)
  union all  --全部合并到一起，不能提前去重
  (select concat(keep_id,id) as flag from weibo_relation)
) as tmp
having count(flag) =2
group by flag;

购买了香蕉的人买了多少东西

这个是一个很经典的问题，购买了香蕉的人买了多少东西
数据还是延用上一个问题的数据和表结构，即理解为关注C的人总共关注了多少人
仔细理解是需要对关注的人进行去重统计

select count(distinct keep_id) as total_keep_id
from weibo_relation
where id
  in
(select id from weibo_relation where keep_id=‘c‘)

hive string

taisenki

0 关注 0 粉丝 0 动态

关注关注

hive函数之~日期函数

语法: datediff返回值: int说明: 返回结束日期减去开始日期的天数。

tomson 2020-07-05

hive函数之~字符串函数

int start, int len)返回值: string说明：返回字符串A从start位置开始，长度为len的字符串。注意，在有些情况下要使用转义字符,类似oracle中的regexp_replace函数。index)返回值: string说明：将字符

zlsdmx 2020-07-05

Hive使用

大量的hive查询任务，如果用交互式shell来进行输入的话，显然效率及其低下，因此，生产中更多的是使用脚本化运行机制：。hive -e "insert into table t_dest select * from t_src;".

tugangkai 2020-07-04

3（Hive）

Hive 运行时，元数据存储在关系型数据库里面。用户接口主要有三个：CLI，Client 和 WUI。其中最常用的是CLI，Cli启动的时候，会同时启动一个Hive副本。Client是Hive的客户端，用户连接至Hive Server。Hive中的元数据包

archive 2020-07-30

Hive函数大全-完整版

现在虽然有很多SQL ON Hadoop的解决方案，像Spark SQL、Impala、Presto等等，但就目前来看，在基于Hadoop的大数据分析平台、数据仓库中，Hive仍然是不可替代的角色。尽管它的相应延迟大，尽管它启动MapReduce的时间相当

成长之路 2020-07-28

hdfs、hive、hbase的搭建总结

-- 完全分布式文件系统的名称：schema ip port -->. -- 分布式文件系统的其他路径的所依赖的一个基础路径，完全分布式不能使用默认值，因为临路径不安全，linux系统在重启时，可能会删除此目录下的内容-->. --

eternityzzy 2020-07-19

hive函数之~hive当中的lateral view 与 explode

lateral view用于和split、explode等UDTF一起使用的，能将一行数据拆分成多行数据，在此基础上可以对拆分的数据进行聚合，lateral view首先为原始表的每行调用UDTF，UDTF会把一行拆分成一行或者多行，lateral vie

taisenki 2020-07-05

hive函数之~窗口函数与分析函数

)row format delimited fields terminated by ‘,‘;第一个参数为列名，第二个参数为往上第n行，第三个参数为默认值。当ORDER BY和窗口从句都缺失, 窗口规范默认是 ROW BETWEEN UNBOUNDED P

tugangkai 2020-07-05

hive函数之~reflect函数

reflect函数可以支持在sql中调用java中的自带函数，秒杀一切udf函数。create table test_udf row format delimited fields terminated by ‘,‘;hive > create ta

SignalDu 2020-07-05

hive函数之~条件函数

‘tom‘ when 2=2 then ‘mary‘ else ‘tim‘ end from tableName;

genshengxiao 2020-07-05

hive函数之~关系运算

select 1 from tableName where 1 <> 2;hive> select 1 from tableName where 1 < = 1;select 1 from tableName where 2 >

tomson 2020-07-05

Hive的安装与启动

<description>JDBC connect string for a JDBC metastore</description>. <description>Driver class name for a JDBC

tomson 2020-07-05

Hive llap服务安装说明及测试（二）

因为Apache Slider 已经不维护了，下面介绍不使用 Slider 的 LLAP使用方式。

xieting 2020-06-28

Hive学习之路（二）Hive安装

bin binary-package-licenses conf examples hcatalog jdbc lib LICENSE NOTICE RELEASE_NOTES.txt scripts. -- 如果 mysql 和 hi

Zhangdragonfly 2020-06-28

Hadoop

Zookeeper：用于 Hadoop 的分布式协调服务。Hadoop 的许多组件依赖于 Zookeeper，它运行在计算机集群中，用于管理 Hadoop 集群。像 Pig 一样，Hive 作为一个抽象层工具，吸引了很多熟悉 SQL 而不是 Java 编程

genshengxiao 2020-06-26

（一）hive远程模式搭建

# scp -r /opt/soft/mysql-5.7.18-1.el7.x86_64.rpm-bundle.tar :/usr/ 发安装包。# rpm -qa|grep mariadb 检查：这是离线包安装方式所以要检查并且卸载mariadb. #

成长之路 2020-06-26

Hive学习(二)【数据类型、类型转换】

可以指定字符集。可以使用单引号或者双引号。和c语言中的struct类似，都可以通过“点”符号访问元素内容。MAP是一组键-值对元组集合，使用数组表示法可以访问数据。数组是一组具有相同类型和名称的变量的集合。例如CAST将把字符串‘1‘ 转换成整数1；如果强

tomson 2020-06-26

Hive1.2.2（一）

本课堂主要围绕hive的基础知识点进行讲解。主要包括以下几个方面。Hive是基于Hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张表，并提供类SQL查询功能。对于Hive来说默认就是mapreduce任务。与操作本地文件系统类似，这里需要使用

蜗牛之窝 2020-06-26

hive开窗开窗函数进阶

NTILE：把有序分区中的行分发到指定数据的组中，各个组有编号，编号从 1 开始，对于每一行， NTILE 返回此行所属的组的编号。hive > select name,orderdate,cost, sum over as sample f

成长之路 2020-06-25

数据仓库 ODS原始数据层操作

①ODS层存放的是原始数据，因此只需要一个字段就行。②ODS层的数据来源于HDFS，里面存储的文件带有压缩，因此需要指明相应的压缩方式。③在多人操作一张Hive表时，最好建立外部表，防止删表时将其中的数据也删掉了。2 编写将数据导入上述表中的脚本。④hiv

cyydjt 2020-06-25

安科网

Hive手写SQL案例

taisenki

1-请详细描述将一个有结构的文本文件student.txt导入到一个hive表中的步骤，及其关键字

2-利用HQL实现以下功能

2-1-创建表

2-2用 HQL 实现，求公司每年的员工费用总支出各是多少，并按年份降序排列?

2-3用 HQL 实现，求各部门每年的员工费用总支出各是多少，并按年份降序，按部门的支出升序排列？

2-4用 HQL 实现，求各部门历史所有员工费用总支出各是多少，按总支出多少排名降序，遇到值相等情况，不留空位。

2-5 用 HQL 实现，创建并生成员工薪资收入动态变化表，即员工 ID，员工姓名，员工本月薪资，本月薪资发放时间，员工上月薪资，上月薪资发放时间。分区字段为本月薪资发放时间。

2-6 用 HQL 实现，薪资涨幅方面，2018 年 5 月份谁的工资涨的最多，谁的涨幅最大？

Hive行列转换

Hive实现wordcount

Hive取TopN

求取指定状态下的订单id

微博体系中互粉的有多少组

购买了香蕉的人买了多少东西

taisenki

相关推荐

hive函数之~日期函数

hive函数之~字符串函数

Hive使用

3（Hive）

Hive函数大全-完整版

hdfs、hive、hbase的搭建总结

hive函数之~hive当中的lateral view 与 explode

hive函数之~窗口函数与分析函数

hive函数之~reflect函数

hive函数之~条件函数

hive函数之~关系运算

Hive的安装与启动

Hive llap服务安装说明及测试（二）

Hive学习之路（二）Hive安装

Hadoop

（一）hive远程模式搭建

Hive学习(二)【数据类型、类型转换】

Hive1.2.2（一）

hive开窗开窗函数进阶

数据仓库 ODS原始数据层操作

taisenki