我有一个json条目的文件看起来像:
{"child_pos": "NN", "parent_pos": "NN", "parent": "fighter", "child_dep": "nn", "parent_dep": "nsubj", "child": "virtua"}
{"child_pos": "NN", "parent_pos": "NN", "parent": "case", "child_dep": "nn", "parent_dep": "nsubj", "child": "martin"}
{"child_pos": "NN", "parent_pos": "NN", "parent": "fighter", "child_dep": "nn", "parent_dep": "nsubj", "child": "virtua"}
{"child_pos": "NN", "parent_pos": "NN", "parent": "fighter", "child_dep": "nn", "parent_dep": "nsubj", "child": "virtua"}
{"child_pos": "NN", "parent_pos": "NN", "parent": "case", "child_dep": "nn", "parent_dep": "nsubj", "child": "martin"}
我想计算文件中不同json对象的频率。我看到了其他答案,我们在Pig中使用Group By和count()函数。我不确定我是否正确使用它们但我没有得到所需的结果。我的输出应该如下:
{"child_pos": "NN", "parent_pos": "NN", "parent": "fighter", "child_dep": "nn", "parent_dep": "nsubj", "child": "virtua", "count": "3"}
{"child_pos": "NN", "parent_pos": "NN", "parent": "case", "child_dep": "nn", "parent_dep": "nsubj", "child": "martin", "count": "2"}
订单并不重要。有人可以给我一些指示吗?
答案 0 :(得分:0)
以下是可以使用的代码,其中包含要分组的所有字段的条件如果您想要其他格式,您可以从元组读取字段并使用任何其他格式
A = LOAD '/user/root/test12.json' USING JsonLoader('child_pos:chararray, parent_pos:chararray, parent:chararray, child_dep:chararray, parent_dep:chararray, child:chararray');
B = GROUp A by (child_pos, parent_pos, parent, child_dep, parent_dep, child) ;
C = FOREACH B GENERATE group, COUNT(A.child_pos) as COUNTX;
STORE C into 'user/data/json_out.json' USING JsonStorage();
out put is ...
{"group": {"child_pos":"NN","parent_pos":"NN","parent":"case","child_dep":"nn","parent_dep":"nsubj","child":"martin"},"COUNTX":2}
{"group":{"child_pos":"NN","parent_pos":"NN","parent":"fighter","child_dep":"nn","parent_dep":"nsubj","child":"virtua"},"COUNTX":3}