Spark中卡夫卡的JSON数据的UpdateStateByKey

时间:2016-12-08 14:31:45

标签: json apache-spark apache-kafka spark-streaming

我在spark.Data中读取kafka中的JSON数据,它有开始和结束标记来显示事务的开始和结束。 JSON数据

`

{    "accountId": 1,   "name": "start"},
{    "accountId": 1,    "name": "A green door",     "prize":107},
{    "accountId": 2,    "name": "start"},
{    "accountId": 2,    "name": "A green door",   "prize":22 },
{    "accountId": 1,    "name": "end"},
{    "accountId": 2,    "name": "ABC",   "prize":221 },
{    "accountId": 2,    "name": "DV",   "prize":223 },
{    "accountId": 2,    "name": "end"}

` 我想使用UpdateStateByKey汇总相应accountId的奖金。

有人可以告诉你怎么做吗?

感谢。

1 个答案:

答案 0 :(得分:0)

data = [
    {"accountId": 1, "name": "start"},
    {"accountId": 1, "name": "A green door", "prize": 107},
    {"accountId": 2, "name": "start"},
    {"accountId": 2, "name": "A green door", "prize": 22},
    {"accountId": 1, "name": "end"},
    {"accountId": 2, "name": "ABC", "prize": 221},
    {"accountId": 2, "name": "DV", "prize": 223},
    {"accountId": 2, "name": "end"}
]

rddQueue = map(lambda x: sc.parallelize(["%s" % x]), data)
qs = ssc.queueStream(rddQueue)
# qs = KafkaUtils.createDirectStream(ssc, ['test'], kafkaParams={}).map(lambda x: x[1])

def makeData(x):
    kv = eval(x)
    k = kv.pop('accountId', 'NaN')
    return str(k), kv

qs_all = qs.map(makeData).updateStateByKey(lambda x, y: (y or []) + x)

qs_all.pprint()

输出:

('2', [{'name': 'start'}, {'prize': 22, 'name': 'A green door'}, {'prize': 221, 'name': 'ABC'}, {'prize': 223, 'name': 'DV'}, {'name': 'end'}])
('1', [{'name': 'start'}, {'prize': 107, 'name': 'A green door'}, {'name': 'end'}])