Question

我有一个猪脚本，可以通过json的“公司”部分加载文件。当我执行计数时，如果域从文件中丢失（或为空），则count为0。我怎么能把它作为空字符串分组并仍然计算在内？

档案示例：

{"company": {"domain": "test1.com", "name": "test1 company"}}
{"company": {"domain": "test1.com", "name": "test1 company"}}
{"company": {"domain": "test1.com", "name": "test2 company"}}
{"company": {"domain": "test2.com", "name": "test2 company"}}
{"company": {"domain": "test2.com", "name": "test3 company"}}
{"company": {"domain": "test3.com", "name": "test3 company"}}
{"company": {"domain": "test3.com", "name": "test3 company"}}
{"company": {"name": "test4 company"}}
{"company": {"name": "test4 company"}}

预期结果：

"test1.com", "test1 company", 2
"test1.com", "test2 company", 1
"test2.com", "test2 company", 1
"test2.com", "test3 company", 1
"test3.com", "test3 company", 2
"", "test4 company", 2

实际结果：

"test1.com", "test1 company", 2
"test1.com", "test2 company", 1
"test2.com", "test2 company", 1
"test2.com", "test3 company", 1
"test3.com", "test3 company", 2
, "test4 company", 0

目前的猪脚本：

data = LOAD'myfile' USINGorg.apache.pig.piggybank.storage.JsonLoader('company:   (domain:chararray, name:chararray)');
filtered = FILTER data BY (company is not null);
events = FOREACH filtered GENERATE FLATTEN(company) as (domain, name);
grouped = GROUP events BY (domain, name);
counts = FOREACH grouped GENERATE group as domain, COUNT(events) as count;
ordered = ORDER counts by count DESC;

感谢您的帮助！

Answer 1

而不是COUNT尝试COUNT_STAR，

counts = FOREACH将GENERATE组分组为域，COUNT_STAR（事件）为count;

pig script：count在null字段返回0

1 个答案: