Hive优化的五种方式
hive执行优化:
1. 使用tez
[code language=”SQL”]
set hive.execution.engine=tez;
[/code]
设置以上属性后,每次hive查询都会利用tez执行任务。
2. 使用ORCFile存储数据
Hive支持ORCFile,
[code language=”SQL”]
CREATE TABLE A_ORC (
customerID int, name string, age int, address string
) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);
INSERT INTO TABLE A_ORC SELECT * FROM A;
CREATE TABLE B_ORC (
customerID int, role string, salary float, department string
) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);
INSERT INTO TABLE B_ORC SELECT * FROM B;
SELECT A_ORC.customerID, A_ORC.name,
A_ORC.age, A_ORC.address join
B_ORC.role, B_ORC.department, B_ORC.salary
ON A_ORC.customerID=B_ORC.customerID;
[/code]
3. 使用Vectorization
Vectorized查询执行会提高操作性能,例如scans、aggregations、filters和joins,通过一次执行1024行提高性能。
该属性在hive 0.13版本引入,该属性能够提高查询执行的时间。
通过设置以下两个参数:
[code language=”SQL”]
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
[/code]
4. 基于成本的查询优化
在提交最终的执行之前,hive会对每一次的逻辑和物理执行计划进行优化。这些优化到目前为止还不是基于成本的查询优化。
Cost-based optimization,在查询成本的基础上进行了优化,生成了一些不同的策略:how to order joins, which type of join to perorm, degree of parallelism and others.
[code language=”SQL”]
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
[/code]
5. Write goog SQL
例如:
[code language=”SQL”]
CREATE TABLE clicks (
timestamp date, sessionID string, url string, source_ip string
) STORED as ORC tblproperties (“orc.compress” = “SNAPPY”);
[/code]
[code language=”SQL”]
SELECT clicks.* FROM clicks inner join
(select sessionID, max(timestamp) as max_ts from clicks
group by sessionID) latest
ON clicks.sessionID = latest.sessionID and
clicks.timestamp = latest.max_ts;
[/code]
[code language=”SQL”]
SELECT * FROM
(SELECT *, RANK() over (partition by sessionID,
order by timestamp desc) as rank
FROM clicks) ranked_clicks
WHERE ranked_clicks.rank=1;
[/code]