我有一个具有以下结构的RDD
RDD[((String), List[(Int,String)])]
它包含此数据
((John),List((4,00A), (5,00A), (15,00B), (15,00C)))
((Billing),List((7,00A)))
((Root),List((1,00A), (2,00B), (3,00C)))
((Marsh),List((2,00B), (3,00C)))
现在我想按照规则过滤它
1:如果列表中不包含' 00A'然后不要退货
2:如果列表包含' 00A'然后返回所有' 00A'物品以及' 00C'列表中的项目。所以结果应该是这样的。
((John),List((4,00A), (5,00A), (15,00C)))
((Billing),List((7,00A)))
((Root),List((1,00A), (3,00C)))
编辑:添加评论中发布的代码。
我试过这个:
val rdd = df
.rdd
.map{case Row(id: Int, name: String, code: String) => ((name), List((id, code)))}
.reduceByKey(_ ++ _);
val r = rdd
.map{case (t, list) => {
val tempList = list.map{case (id, code) => (id, code)}
val newList = tempList.map{case (id, "00A") => (id, "00A")
case (id, "00C") => (id, "00C")
case (id, code) => List.empty }
(t, newList)}
};
答案 0 :(得分:1)
直截了当
这不是一个RDD / Spark问题,所以我用Lists完成了它。如果您使用的是列表,则可以将filter
/ map
作为一个collect
,但collect
表示RDD的内容完全不同
假设您的数据真的是这样的:
val xs = List(("John",List((4,"00A"), (5,"00A"), (15,"00B"), (15,"00C"))),
("Billing",List((7,"00A"))),
("Root",List((1,"00A"), (2,"00B"), (3,"00C"))),
("Marsh", List((2,"00B"), (3,"00C"))))
然后,首先我们过滤以仅获得" 00A"某处。
val filtered = xs.filter{case (key, ys) =>ys.exists(y=>y._2 == "00A")}
他们映射tte结果只返回" 00A"和" 00C"那些
val result = filtered.map{case (key, ys) =>
(key, ys.filter(y=>y._2 == "00A" || y._2 == "00C"))}
//> result : List[(String, List[(Int, String)])] = List(
// (John,List((4,00A), (5, 00A), (15,00C))),
// (Billing,List((7,00A))),
// (Root,List((1,00A), (3,00C))))