我已按以下格式处理数据:
( id ,{ bag of words})
例如:
(foobar, {(foo), (foo),(foobar),(bar)})
(foo,{(bar),(bar)})
等等.. describe processed给我:
processed: {id: chararray,tokens: {tuple_of_tokens: (token: chararray)}}
现在我想要的是...也计算一个单词出现在这个数据中的次数并输出为
foobar, foo, 2
foobar,foobar,1
foobar,bar,1
foo,bar,2
and so on...
我如何在猪身上做到这一点?
答案 0 :(得分:1)
虽然你可以在纯猪中做到这一点,但使用UDF进行此操作应该更有效率。有点像:
@outputschema('wordcounts: {T:(word:chararray, count:int)}')
def generate_wordcount(BAG):
d = {}
for word in BAG:
if word in d:
d[word] += 1
else:
d[word] = 1
return d.items()
然后您可以像这样使用此UDF:
REGISTER 'myudfs.py' USING jython AS myudfs ;
-- A: (id, words: {T:(word:chararray)})
B = FOREACH A GENERATE id, FLATTEN(myudfs.generate_wordcount(words)) ;
答案 1 :(得分:1)
试试这个:
$ cat input
foobar foo
foobar foo
foobar foobar
foobar bar
foo bar
foo bar
--preparing
inputs = LOAD 'input' AS (first: chararray, second: chararray);
grouped = GROUP inputs BY first;
formatted = FOREACH grouped GENERATE group, inputs.second AS second;
--what you need
flattened = FOREACH formatted GENERATE group, FLATTEN(second);
result = FOREACH (GROUP flattened BY (group, second)) GENERATE FLATTEN(group), COUNT(flattened);
DUMP result;
输出:
(foo,bar,2)
(foobar,bar,1)
(foobar,foo,2)
(foobar,foobar,1)