Java UDFs return a scalar result. Java UDTFs are not currently supported.
reference
也就是说,我创建了一个 Java UDF,如下所示
CREATE OR replace function MAP_COUNT(colValue String)
returns OBJECT
language java
handler='Frequency.calculate'
target_path='@~/Frequency.jar'
as
$$
import java.util.HashMap;
import java.util.Map;
import java.util.Optional;
class Frequency {
Map<String, Integer> frequencies = new HashMap<>();
public Map<String, Integer> calculate(String colValue) {
frequencies.putIfAbsent(colValue, 0);
frequencies.computeIfPresent(colValue, (key, value) -> value + 1);
return frequencies;
}
}
$$;
在查询中使用 MAP_COUNT
UDF,如下所示
with temp_1 as
(
SELECT 'John' AS my_col, 27 as age
UNION ALL
SELECT 'John' AS my_col, 28 as age
UNION ALL
SELECT 'doe' AS my_col, 27 as age
UNION ALL
SELECT 'doe' AS my_col, 28 as age
)
select MAP_COUNT(a.my_col) from temp_1 a;
我得到的结果如下
|MAP_COUNT(A.MY_COL) |
|-------------------------------|
|{ "John": "1" } |
|{ "John": "2" } |
|{ "John": "2", "doe": "1" } |
|{ "John": "2", "doe": "2"} |
我期望从我的 UDF 得到的结果如下
|MAP_COUNT(A.MY_COL) |
|-------------------------------|
|{ "John": "2", "doe": "2"} |
在雪花中可能吗?
如果我有如下查询怎么办?
with temp_1 as
(
SELECT 'John' AS my_col, 27 as age
UNION ALL
SELECT 'John' AS my_col, 28 as age
UNION ALL
SELECT 'doe' AS my_col, 27 as age
UNION ALL
SELECT 'doe' AS my_col, 28 as age
)
select MAP_COUNT(a.my_col) as names, MAP_COUNT(a.age) as ages from temp_1 a;
我期望从我的 UDF 得到的结果如下
|names ||AGES |
|-------------------------------||-------------------------------|
|{ "John": "2", "doe": "2"} ||{ "27": "2", "28": "2"} |
有一些方法可以通过简单地重组查询来实现这一点,但我想知道是否可以通过在 select 子句中使用类似于 MAP_COUNT
函数的 OBJECT_AGG
函数来实现。
答案 0 :(得分:2)
当您运行使用 UDF 的查询时,并非所有行都一定会转到 UDF 的同一实例。例如,假设您要从表格中进行选择,并且您这样做了:
SELECT MyUdf(x) FROM T
这里的T
可能有多个micro-partitions,它的执行方式其实类似于:
SELECT MyUdf(x) FROM T_part1 UNION ALL
SELECT MyUdf(x) FROM T_part2 UNION ALL
SELECT MyUdf(x) FROM T_part3 UNION ALL
SELECT MyUdf(x) FROM T_part4
这里有四个单独的 MyUdf
实例,每个实例只看到来自 T
的整个行的子集。
回到您的示例,您正在尝试模拟用户定义的聚合函数,其中 UDF 的特定实例查看每一行。保证这一点的方法是提前聚合,例如:
CREATE OR replace function MAP_COUNT(colValue array)
returns OBJECT
language java
handler='Frequency.calculate'
target_path='@~/Frequency.jar'
as
$$
import java.util.HashMap;
import java.util.Map;
import java.util.Optional;
class Frequency {
public Map<String, Integer> calculate(String[] colValues) {
Map<String, Integer> frequencies = new HashMap<>();
for (String colValue : colValues) {
frequencies.putIfAbsent(colValue, 0);
frequencies.computeIfPresent(colValue, (key, value) -> value + 1);
}
return frequencies;
}
}
$$;
(请注意,我将 UDF 和方法签名分别更改为使用 array
和 String[]
。)现在在查询中使用它:
with temp_1 as
(
SELECT 'John' AS my_col, 27 as age
UNION ALL
SELECT 'John' AS my_col, 28 as age
UNION ALL
SELECT 'doe' AS my_col, 27 as age
UNION ALL
SELECT 'doe' AS my_col, 28 as age
)
select
MAP_COUNT(ARRAY_AGG(a.my_col)) as names,
MAP_COUNT(ARRAY_AGG(a.age)) as ages
from temp_1 a;
这给了我:
names ages
{ "John": "2", "doe": "2" } { "27": "2", "28": "2" }
这里还有两个问题,特别是:
ARRAY_AGG
中。好消息是,一旦 Java UDAF 在未来某个时候可用,这两个问题都将得到解决。