我有两个数据集.. main_data.txt
{"id":"foo", "some_field:12354, "score":0}
{"id":"foobar", "some_field:12354, "score":0}
score_data.txt
{"id":"foo", "score":1}
{"id":"foobar","score":20}
...
所以在main_data中得分初始化为0 .. 另外.. main_data和score_data有一些共同的ID ..
对于常见的ID: 我想将main_data中的“得分”替换为score_data
中的得分如果该元素缺席..那么我想让分数为0本身..
答案 0 :(得分:1)
为什么将“得分”初始化为0?你可以简单地跳过它,加入main_data
(LEFT OUTER)和score_data
。无论你是否跳过,这都应该有效:
main_data = LOAD USING SOME STORAGE; -- asume we have id as column
score_data = LOAD USING SOME STORAGE; -- asume we have id, score as columns
joined_data = JOIN main_data BY main_data::id LEFT OUTER, score_data BY score_data::id;
results = FOREACH joined_data GENERATE main_data::id, (score_data::score IS NULL ? 0 : score_data::score);
STORE results USING SOMETHING SOMEWHERE;