我在spark.Data中读取kafka中的JSON数据,它有开始和结束标记来显示事务的开始和结束。 JSON数据
`
{ "accountId": 1, "name": "start"},
{ "accountId": 1, "name": "A green door", "prize":107},
{ "accountId": 2, "name": "start"},
{ "accountId": 2, "name": "A green door", "prize":22 },
{ "accountId": 1, "name": "end"},
{ "accountId": 2, "name": "ABC", "prize":221 },
{ "accountId": 2, "name": "DV", "prize":223 },
{ "accountId": 2, "name": "end"}
` 我想使用UpdateStateByKey汇总相应accountId的奖金。
有人可以告诉你怎么做吗?
感谢。
答案 0 :(得分:0)
data = [
{"accountId": 1, "name": "start"},
{"accountId": 1, "name": "A green door", "prize": 107},
{"accountId": 2, "name": "start"},
{"accountId": 2, "name": "A green door", "prize": 22},
{"accountId": 1, "name": "end"},
{"accountId": 2, "name": "ABC", "prize": 221},
{"accountId": 2, "name": "DV", "prize": 223},
{"accountId": 2, "name": "end"}
]
rddQueue = map(lambda x: sc.parallelize(["%s" % x]), data)
qs = ssc.queueStream(rddQueue)
# qs = KafkaUtils.createDirectStream(ssc, ['test'], kafkaParams={}).map(lambda x: x[1])
def makeData(x):
kv = eval(x)
k = kv.pop('accountId', 'NaN')
return str(k), kv
qs_all = qs.map(makeData).updateStateByKey(lambda x, y: (y or []) + x)
qs_all.pprint()
输出:
('2', [{'name': 'start'}, {'prize': 22, 'name': 'A green door'}, {'prize': 221, 'name': 'ABC'}, {'prize': 223, 'name': 'DV'}, {'name': 'end'}])
('1', [{'name': 'start'}, {'prize': 107, 'name': 'A green door'}, {'name': 'end'}])