例如 - 在Spark Streaming中,我有表格的传入数据 -
{
"id": xx,
"a" : 1,
"b" : 2,
"c" : 3,
"d" : 4,
"scores"{
"score1" : "",
"score2" : "",
"score3" : ""
}
}
用于处理它的管道如下 -
def func1(row):
row["score"]["score1"]=row["a"]+row["b"]
def func2(row):
row["score"]["score2"]=row["b"]+row["c"]
def func3(row):
row["score"]["score3"]=row["c"]+row["a"]
def publish(iter):
# publish to some cloud db
# For Each RDD
def process(rdd):
rdd1 = rdd.map(func1)
rdd2 = rdd1.map(func2)
rdd3 = rdd2.map(func3)
rdd3.foreachPartition(publish)
由于我的所有rdds都是连续创建的,因此我理解通过将过程功能修改为 -
可以改进def process(rdd):
rdd1 = rdd.map(func1)
rdd2 = rdd.map(func2)
rdd3 = rdd.map(func3)
rdd4 = #combine rdd1, rdd2 rdd3
rdd3.foreachPartition(publish)
我有两个问题 -
示例 - 组合此类值的3个值。 -
{
"id": xx,
"a" : 1,
"b" : 2,
"c" : 3,
"d" : 4,
"scores"{
"score1" : "3",
"score2" : "",
"score3" : ""
}
}
{
"id": xx,
"a" : 1,
"b" : 2,
"c" : 3,
"d" : 4,
"scores"{
"score1" : "",
"score2" : "5",
"score3" : ""
}
}
{
"id": xx,
"a" : 1,
"b" : 2,
"c" : 3,
"d" : 4,
"scores"{
"score1" : "",
"score2" : "",
"score3" : "7"
}
}
进入这种行的rdd -
{
"id": xx,
"a" : 1,
"b" : 2,
"c" : 3,
"d" : 4,
"scores"{
"score1" : "2",
"score2" : "5",
"score3" : "7"
}
}
谢谢!