学习钻MapR Sandbox Lesson2:用ANSI SQL运行查询《译》
目标
这节课展示了如何在Apache Drill中做一些标准的SQL分析:例如,总结数据通过使用简单集合函数和连接数据源。注意,Apache Drill提供ANSI SQL支持,不是一个“类似sql”接口(界面)。
查询在这节课中
现在你知道在他们的原始形式数据源是什么样子的,使用 select *查询,尝试运行一些简单但更有意义的查询在每个数据源。这些查询演示Drill如何支持ANSI SQL构造和钻还如何把来自不同数据源的数据结合在一个单一SELECT声明。
在一个单一文件或者表上显示一个合并查询。 使用GROUP BY,WHERE,HAVING,ORDER BY子句。
执行hive之间的连接、MapR-DB和文件系统的数据源。
用表和列别名。
创建一个钻视图。
聚合
设置hive模式:
0: jdbc:drill:> use hive.`default`; +-------+-------------------------------------------+ | ok | summary | +-------+-------------------------------------------+ | true | Default schema changed to [hive.default] | +-------+-------------------------------------------+ 1 row selected
返回月销量总额:
0: jdbc:drill:> select `month`, sum(order_total) from orders group by `month` order by 2 desc; +------------+---------+ | month | EXPR$1 | +------------+---------+ | June | 950481 | | May | 947796 | | March | 836809 | | April | 807291 | | July | 757395 | | October | 676236 | | August | 572269 | | February | 532901 | | September | 373100 | | January | 346536 | +------------+---------+ 10 rows selected
Drill支持SQL聚合函数,比如SUM,MAX,AVG和MIN。标准的SQL句子做为关系数据库以同样的方式在钻查询中工作。
需要注意,back ticks“month”的列查询,只是因为“month” 在SQL中是保留字。
返回月和州销售总额的前20名:
0: jdbc:drill:> select `month`, state, sum(order_total) as sales from orders group by `month`, state order by 3 desc limit 20; +-----------+--------+---------+ | month | state | sales | +-----------+--------+---------+ | May | ca | 119586 | | June | ca | 116322 | | April | ca | 101363 | | March | ca | 99540 | | July | ca | 90285 | | October | ca | 80090 | | June | tx | 78363 | | May | tx | 77247 | | March | tx | 73815 | | August | ca | 71255 | | April | tx | 68385 | | July | tx | 63858 | | February | ca | 63527 | | June | fl | 62199 | | June | ny | 62052 | | May | fl | 61651 | | May | ny | 59369 | | October | tx | 55076 | | March | fl | 54867 | | March | ny | 52101 | +-----------+--------+---------+ 20 rows selected
SUM函数结果要注意别名。钻支持列别名和表别名。
HAVING条款
这个查询使用HAVING条款去束缚一个聚合结果。
设置dfs.clicks的工作区间:
0: jdbc:drill:> use dfs.clicks; +-------+-----------------------------------------+ | ok | summary | +-------+-----------------------------------------+ | true | Default schema changed to [dfs.clicks] | +-------+-----------------------------------------+ 1 row selected
返回显示高点击的设备总数:
0: jdbc:drill:> select t.user_info.device, count(*) from `clicks/clicks.json` t group by t.user_info.device having count(*) > 1000; +---------+---------+ | EXPR$0 | EXPR$1 | +---------+---------+ | IOS5 | 11814 | | AOS4.2 | 5986 | | IOS6 | 4464 | | IOS7 | 3135 | | AOS4.4 | 1562 | | AOS4.3 | 3039 | +---------+---------+ 6 rows selected
聚合是一种在点击流数据中的计数,为每个不同的移动设备记录。只有设备注册更多的超过1000的交易活动获得结果集。
UNION操作符★★★
像上面一样使用相同的工作区(dfs.clicks)。
在营销活动前后结合点击活动
0: jdbc:drill:> select t.trans_id transaction, t.user_info.cust_id customer from `clicks/clicks.campaign.json` t union all select u.trans_id, u.user_info.cust_id from `clicks/clicks.json` u limit 5; +-------------+------------+ | transaction | customer | +-------------+------------+ | 35232 | 18520 | | 31995 | 17182 | | 35760 | 18228 | | 37090 | 17015 | | 37838 | 18737 | +-------------+------------+
UNION ALL查询返回所有存在的两个文件行中的行(包括任何从这些文件重复的行): clicks.campaign.json 和 clicks.json。
子查询
设置hive工作区:
0: jdbc:drill:> use hive.`default`; +-------+-------------------------------------------+ | ok | summary | +-------+-------------------------------------------+ | true | Default schema changed to [hive.default] | +-------+-------------------------------------------+ 1 row selected
比较跨洲的订单总量:
0: jdbc:drill:> select ny_sales.cust_id, ny_sales.total_orders, ca_sales.total_orders from (select o.cust_id, sum(o.order_total) as total_orders from hive.orders o where state = 'ny' group by o.cust_id) ny_sales left outer join (select o.cust_id, sum(o.order_total) as total_orders from hive.orders o where state = 'ca' group by o.cust_id) ca_sales on ny_sales.cust_id = ca_sales.cust_id order by ny_sales.cust_id limit 20; +------------+------------+------------+ | cust_id | ny_sales | ca_sales | +------------+------------+------------+ | 1001 | 72 | 47 | | 1002 | 108 | 198 | | 1003 | 83 | null | | 1004 | 86 | 210 | | 1005 | 168 | 153 | | 1006 | 29 | 326 | | 1008 | 105 | 168 | | 1009 | 443 | 127 | | 1010 | 75 | 18 | | 1012 | 110 | null | | 1013 | 19 | null | | 1014 | 106 | 162 | | 1015 | 220 | 153 | | 1016 | 85 | 159 | | 1017 | 82 | 56 | | 1019 | 37 | 196 | | 1020 | 193 | 165 | | 1022 | 124 | null | | 1023 | 166 | 149 | | 1024 | 233 | null | +------------+------------+------------+
这个例子演示了Drill支持子查询。
CAST函数★★
使用maprdb工作区:
0: jdbc:drill:> use maprdb; +-------+-------------------------------------+ | ok | summary | +-------+-------------------------------------+ | true | Default schema changed to [maprdb] | +-------+-------------------------------------+ 1 row selected (0.088 seconds)
用适当的数据类型返回客户数据:
0: jdbc:drill:> select cast(row_key as int) as cust_id, cast(t.personal.name as varchar(20)) as name, cast(t.personal.gender as varchar(10)) as gender, cast(t.personal.age as varchar(10)) as age, cast(t.address.state as varchar(4)) as state, cast(t.loyalty.agg_rev as dec(7,2)) as agg_rev, cast(t.loyalty.membership as varchar(20)) as membership from customers t limit 5; +----------+----------------------+-----------+-----------+--------+----------+-------------+ | cust_id | name | gender | age | state | agg_rev | membership | +----------+----------------------+-----------+-----------+--------+----------+-------------+ | 10001 | "Corrine Mecham" | "FEMALE" | "15-20" | "va" | 197.00 | "silver" | | 10005 | "Brittany Park" | "MALE" | "26-35" | "in" | 230.00 | "silver" | | 10006 | "Rose Lokey" | "MALE" | "26-35" | "ca" | 250.00 | "silver" | | 10007 | "James Fowler" | "FEMALE" | "51-100" | "me" | 263.00 | "silver" | | 10010 | "Guillermo Koehler" | "OTHER" | "51-100" | "mn" | 202.00 | "silver" | +----------+----------------------+-----------+-----------+--------+----------+-------------+
注意这个查询的以下特点:
CAST函数所需表中每一列。这个函数返回MapR-DB / HBase二进制可读整型和字符串数据。或者,您可以使用CONVERT_TO / CONVERT_FROM函数译码字符串列。在大多数情况下CONVERT_TO / CONVERT_FROM比CAST更有效。只使用CONVERT_TO转换二进制类型到VARCHAR以外的其他任何类型。☆♫
row_key列函数作为表的主键(在这种情况下客户ID)。
表别名t是必需的;否则列族名称将被解析为表名,查询将返回一个错误。★▲
删除字符串引号:
你可以使用regexp_replace函数来删除在查询结果字符串中的引号。例如,返回弗吉尼亚州一个州的名字va而不是“va”:
0: jdbc:drill:> select cast(row_key as int), regexp_replace(cast(t.address.state as varchar(10)),'"','') from customers t limit 1; +------------+------------+ | EXPR$0 | EXPR$1 | +------------+------------+ | 10001 | va | +------------+------------+ 1 row selected
创建视图命令
0: jdbc:drill:> use dfs.views; +-------+----------------------------------------+ | ok | summary | +-------+----------------------------------------+ | true | Default schema changed to [dfs.views] | +-------+----------------------------------------+ 1 row selected
使用可变工作区:
一个可变的(或可写)工作空间是一个支持“写” 操作工作区。这个属性是配置存储插件的一部分。 你可以在可变工作区创建钻视图和表。
在MapR-DB表创建视图:
0: jdbc:drill:> create or replace view custview as select cast(row_key as int) as cust_id, cast(t.personal.name as varchar(20)) as name, cast(t.personal.gender as varchar(10)) as gender, cast(t.personal.age as varchar(10)) as age, cast(t.address.state as varchar(4)) as state, cast(t.loyalty.agg_rev as dec(7,2)) as agg_rev, cast(t.loyalty.membership as varchar(20)) as membership from maprdb.customers t; +-------+-------------------------------------------------------------+ | ok | summary | +-------+-------------------------------------------------------------+ | true | View 'custview' created successfully in 'dfs.views' schema | +-------+-------------------------------------------------------------+ 1 row selected
钻提供了类似于关系数据库创建(CREAT)或替换视图(REPLACE VIEW)的语法创建视图。使用或替换选项更容易查看更新视图后没有先删除它。注意,在FROM子句中这个例子必须参考maprdb.customers。MapR-DB表到dfs.views工作区并不直接可见。
不像传统数据库视图位置,代表性的是DBA /开发者驱动操作,在drill中基于文件系统的视图非常无足轻重。一个视图只是一个特殊的文件与一个特定的(.drill)扩展。您可以存储视图在您的本地文件系统或指向一个特定的工作区。你可以指定任何查询而不是钻数据源在创建视图体内声明。
钻提供了分散的元数据模型。钻能够查询定义在数据源的元数据如hive,HBase和文件系统。钻也支持在文件系统中的创建元数据。
查询的视图中数据:
0: jdbc:drill:> select * from custview limit 1; +----------+-------------------+-----------+----------+--------+----------+-------------+ | cust_id | name | gender | age | state | agg_rev | membership | +----------+-------------------+-----------+----------+--------+----------+-------------+ | 10001 | "Corrine Mecham" | "FEMALE" | "15-20" | "va" | 197.00 | "silver" | +----------+-------------------+-----------+----------+--------+----------+-------------+ 1 row selected
通过直接从文件系统中探测一旦用户知道哪些数据可用,视图可以用来读取数据到下游工具例如Tableau和MicroStrategy进行分析和可视化。 这些工具,出现一个视图只是作为一个“表”和一个可选择的“列”。
通过数据源查询
继续使用 dfs.views查询。
★★连结客户视图和订单表:
0: jdbc:drill:> select membership, sum(order_total) as sales from hive.orders, custview where orders.cust_id=custview.cust_id group by membership order by 2; +------------+------------+ | membership | sales | +------------+------------+ | "basic" | 380665 | | "silver" | 708438 | | "gold" | 2787682 | +------------+------------+ 3 rows selected
★在这个查询中,我们从MapR-DB表(由custview代表)读取数据,并与订单表信息合并在hive中。 当做像这样的跨数据源查询,你需要使用完全限定 表/视图名称。例如,通过“hive”orders表是有前缀的,那是存储插件通过drill注册的名称。我们不能为“custview”使用任何前缀,因为我们明确地转换了dfs.views custview被存储在哪的工作区。
注意:如果你的任何查询的结果似乎被截断,因为 行宽,设置显示的最大宽度10000:
这组命令不使用分号。
★★★加入客户、订单和点击流数据:
0: jdbc:drill:> select custview.membership, sum(orders.order_total) as sales from hive.orders, custview, dfs.`/mapr/demo.mapr.com/data/nested/clicks/clicks.json` c where orders.cust_id=custview.cust_id and orders.cust_id=c.user_info.cust_id group by custview.membership order by 2; +------------+------------+ | membership | sales | +------------+------------+ | "basic" | 372866 | | "silver" | 728424 | | "gold" | 7050198 | +------------+------------+ 3 rows selected
★这三方结合选择从三个不同的数据源在一次查询查询:
hive.orders表
custview(HBase客户表的视图)
clicks.json文件
两组的连接列加入条件是cust_id列。视图工作区用于这个查询以至于custview可以访问。hive.orders表也轻松查询。
然而,注意JSON文件不是从视图工作区中直接可见的,所以查询指定了文件的完整路径:
dfs.`/mapr/demo.mapr.com/data/nested/clicks/clicks.json`
Lesson 3: Run Queries on Complex Data Types →