impala在一个select中执行多个count distinct时会报错,比如执行
select key, count(distinct column_a), count(distinct column_b) from test_table group by key
会报错
Query submitted at: 2019-09-28 00:34:20 (Coordinator: http://DataOne-001:25000)ERROR: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters as count(DISTINCT column_a); deviating function: count(DISTINCT column_b)Consider using NDV() instead of COUNT(DISTINCT) if estimated counts are acceptable. Enable the APPX_COUNT_DISTINCT query option to perform this rewrite automatically.
这时有几种方法:
1.1 set APPX_COUNT_DISTINCT = true1.2 count distinct改为ndv,即ndv(column_a)这两种方法底层实现是一样的,设置APPX_COUNT_DISTINCT会自动将count distinct改写为ndv,ndv全称为(number of distinct values),用到 Cardinality(基数计数),底层实现是类似HLLC(Hyper LogLog Counting)这种概率算法,详见参考;
An aggregate function that returns an approximate value similar to the result of COUNT(DISTINCT col), the "number of distinct values". It is much faster than the combination of COUNT and DISTINCT, and uses a constant amount of memory and thus is less memory-intensive for columns with high cardinality.
改写为多个子查询然后join,比如
select a.key, a.count_a, b.count_b from (select key, count(distinct column_a) count_a from test_table group by key) a join(select key, count(distinct column_b) count_b from test_table group by key) b on a.key = b.key
参考:
ndv
http://impala.apache.org/docs/build/html/topics/impala_ndv.html#ndv
APPX_COUNT_DISTINCT
http://impala.apache.org/docs/build/html/topics/impala_appx_count_distinct.html
其他
https://stackoverflow.com/questions/39236076/impala-all-distinct-aggregate-functions-need-to-have-the-same-set-of-parameters
转载于:https://www.cnblogs.com/barneywill/p/11601234.html