我正在尝试读取数据集并对其进行处理;数据集行类型是(string,string,string,Map [String,String]),Map.keys的数量是1到3,所以一行将变成1-3行(string,string,string,k, v)。 我实际上使用如下代码实现它:
var arr = new ArrayBuffer[Array[String]]()
myDataset.collect.foreach{
f:(String,String,String,Map[String,String]) =>
val ma = f._4
for((k,v)<-ma) {
arr += Array(f._1,f._2,f._3,k,v)
}
}
像这样的Orgdata(mydataset中的一行:数亿):
val a = ("111","222","333",Map("k1"->"v1","k2"->"v2"))
预期产出:
("111","222","333","k1","v1")
("111","222","333","k2","v2")
但是大数据会导致OOM问题,还有其他方法可以实现吗?或者如何优化我的代码以避免OOM?
答案 0 :(得分:1)
您只需explode
地图列,然后选择展开的列:
val df = sc.parallelize(Array(
("111","222","333",Map("k1"->"v1","k2"->"v2"))
)).toDF("a", "b", "c", "d")
df.select($"*", explode($"d") )
.select("a", "b", "c" ,"key", "value")
.as[(String, String, String, String, String)]
.first
// (String, String, String, String, String) = (111,222,333,k1,v1)