PostgreSQL数据库随机取数据,数据抽样
- 对于数据量比较大的表,不要使用下面的这种方法,会比较慢
select * from review order by random() limit 10;
会全表扫描,很慢
- SYSTEM抽样方法
SELECT ... FROM table_name TABLESAMPLE sampling_method (argument [, ...]) [REPEATABLE (seed)]
sampling_method 抽样方法,主要有两种,system和bernoulli,argument是指抽样百分比,
举个栗子
select id from review tablesample system(0.0001); [email protected]:5432=#explain analyze select id from review tablesample system(0.0001); QUERY PLAN ---------------------------------------------------------------------------------------------------- Sample Scan on review (cost=0.00..4.06 rows=6 width=4) (actual time=3.328..3.899 rows=10 loops=1) Sampling: system ('0.0001'::real) Planning time: 0.110 ms Execution time: 3.920 ms (4 rows)
需要指出一条的是,system抽样方式是随机抽取表上的数据块的数据,理论上每个数据块被检索的概率是一样的,system抽样方式基于数据块级别,被选中的快上面的所以数据将被检索, 所以可能会出现每次随机抽取的数据条数是不一样的,甚至可能返回0条,可能是抽取到了一个没有数据的数据块,
关于数据块检索我们可以通过下面的例子看到,ctid是数据块的值,前面的那个是数据库编号,后面那个是该条数据在数据块上面的编号
[email protected]:5432=#select ctid, id from review tablesample system(0.0001); ctid | id ------+---- (0 rows) [email protected]:5432=#select ctid, id from review tablesample system(0.0001); ctid | id ------------+--------- (289034,1) | 4131829 (289034,2) | 4131830 (289034,3) | 4131853 (289034,4) | 4131854 (289034,5) | 4924212 (289034,6) | 5142532 (6 rows) [email protected]:5432=#select ctid, id from review tablesample system(0.0001); ctid | id ------+---- (0 rows) [email protected]:5432=#select ctid, id from review tablesample system(0.0001); ctid | id -------------+--------- (368648,1) | 5006567 (368648,2) | 5164392 (368648,3) | 4844708 (368648,4) | 4844524 (368648,5) | 4844526 (368648,6) | 4844621 (368648,7) | 4844748 (368648,8) | 4844614 (368648,9) | 4844678 (368648,10) | 4844530 (368648,11) | 4844532 (11 rows)
- BERNOULLI抽样方式
BERNOULLI抽样方式,是基于数据行的,所以相对来说,每次返回的数量相对比较一致,相差不多,这种抽取方式比SYSTEM抽样方式更加随机一些,但是相对性能要低很多,
[email protected]:5432=#explain analyze select ctid, id from review tablesample bernoulli(0.0001); QUERY PLAN -------------------------------------------------------------------------------------------------------------- Sample Scan on review (cost=0.00..429583.06 rows=6 width=10) (actual time=136.399..1524.570 rows=6 loops=1) Sampling: bernoulli ('0.0001'::real) Planning time: 0.113 ms Execution time: 1524.615 ms (4 rows)
然后由于BERNOULLI抽样方式是基于数据行的,所以一般而言,抽样得到的数据应该位于不同的数据块
[email protected]:5432=#select ctid, id from review tablesample bernoulli(0.0001); ctid | id -------------+--------- (82429,15) | 1419928 (131678,12) | 2159220 (273168,2) | 3965564 (344546,3) | 4675681 (4 rows)
- 总结
SYSTEM抽样方式适合抽样效率优先,针对大数据量啥的;
BERNOULLI适合随机性优先的场景
相关推荐
yangkang 2020-11-09
lbyd0 2020-11-17
sushuanglei 2020-11-12
85477104 2020-11-17
KANSYOUKYOU 2020-11-16
wushengyong 2020-10-28
lizhengjava 2020-11-13
星月情缘 2020-11-13
huangxiaoyun00 2020-11-13
luyong0 2020-11-08
腾讯soso团队 2020-11-06
Apsaravod 2020-11-05
PeterChangyb 2020-11-05
gaobudong 2020-11-04
wwwjun 2020-11-02
gyunwh 2020-11-02
EchoYY 2020-10-31
dingyahui 2020-10-30