如何根据map.keys的数量将一行分析成几行

时间:2017-12-07 08:40:40

标签: scala apache-spark dataset

我正在尝试读取数据集并对其进行处理;数据集行类型是(string,string,string,Map [String,String]),Map.keys的数量是1到3,所以一行将变成1-3行(string,string,string,k, v)。 我实际上使用如下代码实现它:

var arr  = new ArrayBuffer[Array[String]]()
myDataset.collect.foreach{
f:(String,String,String,Map[String,String]) =>
    val ma = f._4
    for((k,v)<-ma) {
        arr += Array(f._1,f._2,f._3,k,v)
    }
}
像这样的Orgdata(mydataset中的一行:数亿):

val a = ("111","222","333",Map("k1"->"v1","k2"->"v2"))

预期产出:

("111","222","333","k1","v1")
("111","222","333","k2","v2")

但是大数据会导致OOM问题,还有其他方法可以实现吗?或者如何优化我的代码以避免OOM?

1 个答案:

答案 0 :(得分:1)

您只需explode地图列,然后选择展开的列:

val df = sc.parallelize(Array(
    ("111","222","333",Map("k1"->"v1","k2"->"v2"))
)).toDF("a", "b", "c", "d")

df.select($"*", explode($"d") )
  .select("a", "b", "c" ,"key", "value")
  .as[(String, String, String, String, String)]
  .first
// (String, String, String, String, String) = (111,222,333,k1,v1)