Hive的内置服务和hiveserver/hiveserver2的比较

leys

2016-09-23

一：Hive的几种内置服务

执行bin/hive --service help 如下：

[master@master1 hive]$ bin/hive --service help
ls: 无法访问/opt/spark/lib/spark-assembly-*.jar: 没有那个文件或目录
Usage ./hive <parameters> --service serviceName <service parameters>
Service List: beeline cli help hiveburninclient hiveserver2 hiveserver hwi jar lineage metastore metatool orcfiledump rcfilecat schemaTool version 
Parameters parsed:
  --auxpath : Auxillary jars 
  --config : Hive configuration directory
  --service : Starts specific service/component. cli is default
Parameters used:
  HADOOP_HOME or HADOOP_PREFIX : Hadoop install directory
  HIVE_OPT : Hive options
For help on a particular service:
  ./hive --service serviceName --help
Debug help:  ./hive --debug --help

我们可以看到上边输出项Server List，里边显示出Hive支持的服务列表，beeline cli help hiveserver2 hiveserver hwi jar lineage metastore metatool orcfiledump rcfilecat，下面介绍最有用的一些服务

1、cli：是Command Line Interface 的缩写，是Hive的命令行界面，用的比较多，是默认服务，直接可以在命令行里使用

2、hiveserver：这个可以让Hive以提供Thrift服务的服务器形式来运行，可以允许许多个不同语言编写的客户端进行通信，使用需要启动HiveServer服务以和客户端联系，我们可以通过设置HIVE_PORT环境变量来设置服务器所监听的端口，在默认情况下，端口号为10000，这个可以通过以下方式来启动Hiverserver：

bin/hive --service hiveserver -p 10002

其中-p参数也是用来指定监听端口的

3、hwi：其实就是hive web interface的缩写它是hive的web借口，是hive cli的一个web替代方案

4、jar：与hadoop jar等价的Hive接口，这是运行类路径中同时包含Hadoop 和Hive类的Java应用程序的简便方式

5、metastore：在默认的情况下，metastore和hive服务运行在同一个进程中，使用这个服务，可以让metastore作为一个单独的进程运行，我们可以通过METASTOE——PORT来指定监听的端口号

二：Hive的三种启动方式

1， hive 命令行模式

进入hive安装目录，输入bin/hive的执行程序，或者输入 hive –service cli

用于linux平台命令行查询，查询语句基本跟mysql查询语句类似

2， hive web界面的启动方式

bin/hive --service hwi （& 表示后台运行）

用于通过浏览器来访问hive，感觉没多大用途，浏览器访问地址是：127.0.0.1:9999/hwi

3， hive 远程服务 (端口号10000) 启动方式

bin/hive --service hiveserver2 &（&表示后台运行）

用java，python等程序实现通过jdbc等驱动的访问hive就用这种起动方式了，这个是程序员最需要的方式了

三：hiveServer/HiveServer2

1：简单介绍

两者都允许远程客户端使用多种编程语言，通过HiveServer或者HiveServer2，客户端可以在不启动CLI的情况下对Hive中的数据进行操作，连这个和都允许远程客户端使用多种编程语言如java，python等向hive提交请求，取回结果（从hive0.15起就不再支持hiveserver了），但是在这里我们还是要说一下hiveserver

HiveServer或者HiveServer2都是基于Thrift的，但HiveSever有时被称为Thrift server，而HiveServer2却不会。既然已经存在HiveServer，为什么还需要HiveServer2呢？这是因为HiveServer不能处理多于一个客户端的并发请求，这是由于HiveServer使用的Thrift接口所导致的限制，不能通过修改HiveServer的代码修正。因此在Hive-0.11.0版本中重写了HiveServer代码得到了HiveServer2，进而解决了该问题。HiveServer2支持多客户端的并发和认证，为开放API客户端如JDBC、ODBC提供更好的支持。

2：两者的区别

Hiveserver1 和hiveserver2的JDBC区别：
HiveServer version Connection URL Driver Class

HiveServer2 jdbc:hive2://: org.apache.hive.jdbc.HiveDriver
HiveServer1 jdbc:hive://: org.apache.hadoop.hive.jdbc.HiveDriver

3：学习HiveServer和HiveServer2

HiveServer：

在命令行输入hive --service hiveserver –help查看hiveserver的帮助信息：

[hadoop@hadoop~]$ hive --service hiveserver --help
Starting Hive Thrift Server
usage:hiveserver
-h,--help                        Print help information
    --hiveconf <property=value>   Use value for given property
    --maxWorkerThreads <arg>      maximum number of worker threads,
                                 default:2147483647
    --minWorkerThreads <arg>      minimum number of worker threads,
                                  default:100
-p <port>                        Hive Server portnumber, default:10000
-v,--verbose                     Verbose mode

启动hiveserver服务，可以得知默认hiveserver运行在端口10000，最小100工作线程，最大2147483647工作线程。

[hadoop@hadoop~]$ hive --service hiveserver -v
Starting Hive Thrift Server
14/08/01 11:07:09WARN conf.HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has anyeffect.  Use hive.hmshandler.retry.*instead
Starting hive serveron port 10000 with 100 min worker threads and 2147483647 maxworker threads

以上的hiveserver在hive1.2.1中并不会出现，官网的说法是：

HiveServer is scheduled to be removed from Hive releases starting Hive 0.15. See HIVE-6977. Please switch over to HiveServer2.

Hiveserver2

Hiveserver2允许在配置文件hive-site.xml中进行配置管理，具体的参数为：

hive.server2.thrift.min.worker.threads– 最小工作线程数，默认为5。
hive.server2.thrift.max.worker.threads – 最小工作线程数，默认为500。
hive.server2.thrift.port– TCP 的监听端口，默认为10000。
hive.server2.thrift.bind.host– TCP绑定的主机，默认为localhost

也可以设置环境变量HIVE_SERVER2_THRIFT_BIND_HOST和HIVE_SERVER2_THRIFT_PORT覆盖hive-site.xml设置的主机和端口号。从Hive-0.13.0开始，HiveServer2支持通过HTTP传输消息，该特性当客户端和服务器之间存在代理中介时特别有用。与HTTP传输相关的参数如下：

hive.server2.transport.mode – 默认值为binary（TCP），可选值HTTP。
hive.server2.thrift.http.port– HTTP的监听端口，默认值为10001。
hive.server2.thrift.http.path – 服务的端点名称，默认为 cliservice。
hive.server2.thrift.http.min.worker.threads– 服务池中的最小工作线程，默认为5。
hive.server2.thrift.http.max.worker.threads– 服务池中的最小工作线程，默认为500。

启动Hiveserver2有两种方式，一种是上面已经介绍过的hive --service hiveserver2，另一种更为简洁，为hiveserver2。使用hive--service hiveserver2 –H或hive--service hiveserver2 –help查看帮助信息：

Starting HiveServer2
Unrecognizedoption: -h
usage:hiveserver2
-H,--help                        Print help information
    --hiveconf <property=value>   Use value for given property

默认情况下，HiveServer2以提交查询的用户执行查询（true），如果hive.server2.enable.doAs设置为false，查询将以运行hiveserver2进程的用户运行。为了防止非加密模式下的内存泄露，可以通过设置下面的参数为true禁用文件系统的缓存：

fs.hdfs.impl.disable.cache – 禁用HDFS文件系统缓存，默认值为false。
fs.file.impl.disable.cache – 禁用本地文件系统缓存，默认值为false。

4：配置使用hiveserver2（Hive 2.0为例）

sudo vim hive-site.xml

1)：配置监听端口和路径

<property><name>hive.server2.thrift.port</name><value>10000</value></property><property><name>hive.server2.thrift.bind.host</name><value>192.168.48.130</value></property>

2)：设置impersonation

这样hive server会以提交用户的身份去执行语句，如果设置为false，则会以起hive server daemon的admin user来执行语句

<property>
  <name>hive.server2.enable.doAs</name>
  <value>true</value>
</property>

3):hiveserver2节点配置

Hiveserver2已经不再需要hive.metastore.local这个配置项了（hive.metastore.uris为空，则表示是metastore在本地，否则

就是远程）远程的话直接配置hive.metastore.uris即可

<property>
    <name>hive.metastore.uris</name>
    <value>thrift://xxx.xxx.xxx.xxx:9083</value>
    <description>Thrift URI for the remote metastore. Used by metastore client to con
nect to remote metastore.</description>
  </property>

4)：zookeeper配置

<property>
  <name>hive.support.concurrency</name>
  <description>Enable Hive's Table Lock Manager Service</description>
  <value>true</value>
</property>
<property>
  <name>hive.zookeeper.quorum</name>
  <description>Zookeeper quorum used by Hive's Table Lock Manager</description>
  <value>master1:2181,slave1:2181,slave2:2181</value>
</property>

注意：没有配置hive.zookeeper.quorum会导致无法并发执行hive ql请求和导致数据异常

5)：hiveserver2的Web UI配置

Hive 2.0 以后才支持Web UI的，在以前的版本中并不支持

<property>
    <name>hive.server2.webui.host</name>
    <value>192.168.48.130</value>
    <description>The host address the HiveServer2 WebUI will listen on</description>
  </property>
  <property>
    <name>hive.server2.webui.port</name>
    <value>10002</value>
    <description>The port the HiveServer2 WebUI will listen on. This can beset to 0 o
r a negative integer to disable the web UI</description>
  </property>

启动服务：

1)：启动metastore

bin/hive --service metastore &

默认端口为9083

2)：启动hiveserver2

bin/hive --service hiveserver2 &

3)：测试

Web UI：http://192.168.48.130:10002/

Hive的内置服务和hiveserver/hiveserver2的比较

使用beeline控制台控制hiveserver2

启动beeline ：bin/beeline

连接：!connect jdbc:hive2://192.168.48.130:10000 hive hive

出现错误： org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: master is not allowed to impersonate hive (state=,code=0)

解决办法：http://www.aboutyun.com/blog-331-2956.html

PS：小编在这里并没有解决，因为这个beeline基本用不到，所以就暂时放放了，后期如果需要的话再来解决它

======2016.09.14更======================================================

由于最近要拿python写一个hive的客户端，于是重新看了下这篇博客，试着解决beeline这个问题

hiveserver2提供了一个新的命令行工具Beeline，他是基于SQLLine CLI的JDBC客户端，Beeline工作模式有两种，即本地嵌入模式和远程模式，嵌入模式情况下，他返回一个嵌入式的Hive，类似于Hive CLI，而远程模式则是通过Thrift协议与某个单独的hiveserver2进程进行连接通信，下面看一个Beeline的例子：

[root@master1 hive]# bin/beeline 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/bigdata/spark/lib/spark-assembly-1.6.2-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/bigdata/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/bigdata/spark/lib/spark-assembly-1.6.2-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/bigdata/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Beeline version 1.2.1 by Apache Hive
beeline> !connect jdbc:hive2://192.168.132.27:10000
Connecting to jdbc:hive2://192.168.132.27:10000
Enter username for jdbc:hive2://192.168.132.27:10000: hive        （这里输入账号）
Enter password for jdbc:hive2://192.168.132.27:10000: ****        （这里输入密码）
Connected to: Apache Hive (version 1.2.1)
Driver: Hive JDBC (version 1.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://192.168.132.27:10000> show databases;              （查看数据库）
OK
+----------------+--+
| database_name  |
+----------------+--+
| default        |
+----------------+--+
1 row selected (0.274 seconds)
0: jdbc:hive2://192.168.132.27:10000> use default;                  （选定数据库）
OK 
No rows affected (0.069 seconds)
0: jdbc:hive2://192.168.132.27:10000> show tables;                  （查看表）
OK
+-----------+--+
| tab_name  |
+-----------+--+
+-----------+--+
No rows selected (0.093 seconds)
0: jdbc:hive2://192.168.132.27:10000> create table test(name string); （创建表）
OK
No rows affected (0.961 seconds)
0: jdbc:hive2://192.168.132.27:10000> show tables;                    （查看表）
OK
+-----------+--+
| tab_name  |
+-----------+--+
| test      |
+-----------+--+
1 row selected (0.129 seconds)
0: jdbc:hive2://192.168.132.27:10000> desc test;                       （描述表）
OK
+-----------+------------+----------+--+
| col_name  | data_type  | comment  |
+-----------+------------+----------+--+
| name      | string     |          |
+-----------+------------+----------+--+
1 row selected (0.258 seconds)

OK！！！