Hive优化的五种方式

hive执行优化:

1. 使用tez

[code language=”SQL”]

set hive.execution.engine=tez;

[/code]

设置以上属性后,每次hive查询都会利用tez执行任务。

2. 使用ORCFile存储数据

Hive支持ORCFile,

[code language=”SQL”]

CREATE TABLE A_ORC (

customerID int, name string, age int, address string

) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);

INSERT INTO TABLE A_ORC SELECT * FROM A;

CREATE TABLE B_ORC (

customerID int, role string, salary float, department string

) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);

INSERT INTO TABLE B_ORC SELECT * FROM B;

SELECT A_ORC.customerID, A_ORC.name,

A_ORC.age, A_ORC.address join

B_ORC.role, B_ORC.department, B_ORC.salary

ON A_ORC.customerID=B_ORC.customerID;

[/code]

3. 使用Vectorization

Vectorized查询执行会提高操作性能,例如scans、aggregations、filters和joins,通过一次执行1024行提高性能。

该属性在hive 0.13版本引入,该属性能够提高查询执行的时间。

通过设置以下两个参数:

[code language=”SQL”]

set hive.vectorized.execution.enabled = true;

set hive.vectorized.execution.reduce.enabled = true;

[/code]

4. 基于成本的查询优化

在提交最终的执行之前,hive会对每一次的逻辑和物理执行计划进行优化。这些优化到目前为止还不是基于成本的查询优化。

Cost-based optimization,在查询成本的基础上进行了优化,生成了一些不同的策略:how to order joins, which type of join to perorm, degree of parallelism and others.

[code language=”SQL”]

set hive.cbo.enable=true;

set hive.compute.query.using.stats=true;

set hive.stats.fetch.column.stats=true;

set hive.stats.fetch.partition.stats=true;

[/code]

5. Write goog SQL

例如:

[code language=”SQL”]

CREATE TABLE clicks (

timestamp date, sessionID string, url string, source_ip string

) STORED as ORC tblproperties (“orc.compress” = “SNAPPY”);

[/code]

[code language=”SQL”]

SELECT clicks.* FROM clicks inner join

(select sessionID, max(timestamp) as max_ts from clicks

group by sessionID) latest

ON clicks.sessionID = latest.sessionID and

clicks.timestamp = latest.max_ts;

[/code]

[code language=”SQL”]

SELECT * FROM

(SELECT *, RANK() over (partition by sessionID,

order by timestamp desc) as rank

FROM clicks) ranked_clicks

WHERE ranked_clicks.rank=1;

[/code]

相关推荐