hive NULL 以及在join连接时 \'和NULL连接查询结果

李双喆

2015-12-16

关注关注

hive NULL生成方式：

1 文本方式:

如果通过 load data local inpath '/opt/stu.txt' into table stu 这种上传本地文件数据方式到Hive制造hive NULL的时候，需要在 stu.txt中用

id name

1 zhangsan

2 \N

写法，而不是

id name

1 zhangsan

2 null ----> 这样入hive后的数据是字符串 null

2 数据库导入方式：

如果通过关系库sqoop导入到hdfs后通过hive外表关联方式：

类型	关系型数据库	hive
string	null	\N
string	""	""
int等其余类型	null	\N

对应hive string类型查询：
如果为 \N 使用is null查询 eg: where b.name is null;
如果为 “” 使用=''查询 eg: where b.name ='';

hive中 '' 和 NULL 在表关联下是否会产生多余数据:

先以mysql为例讲解:

insert into stu(id,name) values('1','zm1');
insert into stu(id) values('2');
insert into stu(id) values('3');

insert into stu1(id,name,age) values('4','zm1','31');
insert into stu1(id,age) values('5','32');
insert into stu1(id,age) values('6','33');

select a.*, b.* from stu a join stu1 b on a.name = b.name;

 
mysql> select * from stu;
+------+------+
| id   | name |
+------+------+
| 1    | zm1  |
| 2    | NULL |
| 3    | NULL |
+------+------+

mysql> select * from stu1;
+------+------+------+
| id   | name | age  |
+------+------+------+
| 4    | zm1  | 31   |
| 5    | NULL | 32   |
| 6    | NULL | 33   |
+------+------+------+

mysql> select a.*, b.* from stu a join stu1 b on a.name = b.name;
+------+------+------+------+------+
| id   | name | id   | name | age  |
+------+------+------+------+------+
| 1    | zm1  | 4    | zm1  | 31   |
+------+------+------+------+------+

 ！！！！！！！！！！！！！mysql会进行优化 将on中name为null的去掉！！！！！！！！！！！！
 



  
将上述stu stu1数据做修改, update stu set name='' where id = '2' 都修改后展示时mysql为如下样子：
mysql> select * from stu;
+------+------+
| id   | name |
+------+------+
| 1    | zm1  |
| 2    |      |
| 3    |      |
+------+------+
3 rows in set (0.00 sec)

mysql> select * from stu1;
+------+------+------+
| id   | name | age  |
+------+------+------+
| 4    | zm1  | 31   |
| 5    |      | 32   |
| 6    |      | 33   |
+------+------+------+
3 rows in set (0.00 sec)

mysql> select a.*, b.* from stu a join stu1 b on a.name = b.name;
+------+------+------+------+------+
| id   | name | id   | name | age  |
+------+------+------+------+------+
| 1    | zm1  | 4    | zm1  | 31   |
| 2    |      | 5    |      | 32   |
| 3    |      | 5    |      | 32   |
| 2    |      | 6    |      | 33   |
| 3    |      | 6    |      | 33   |
+------+------+------+------+------+
5 rows in set (0.00 sec)

！！！！！！！！！！！！！mysql不会对''做优化 关联查询时候会被关联到！！！！！！！！！！！！

讲mysql ‘’ 和 NULL 是因为经过试验操作 hive操作结果和 mysql操作结果一样,

即:

''下hive关联操作的字段会被作为关联条件,这样会产生很多垃圾数据,在ETL中数据做了预处理后,

建议查询条件增加非空串判断

eg: from stu a join tea b on a.name = b.name and a.name !='' and b.name != '' ;

NULL下hive关联操作的字段不会作为关联条件.

hive td