使用一串键值对内容进行排序,以匹配相应的列

时间:2017-07-24 05:13:15

标签: scala apache-spark apache-spark-sql data-manipulation

要求
展开一列,其中一串具有可变长度的键值存储到多个列,使用来自镶木地板的Scala API,使用Spark的相应值。

示例
输入

+----------+-------------------------------------+
| identity | Original                            |
+----------+-------------------------------------+
| 1        | key1=value1&key2=value2             |
+----------+-------------------------------------+
| 2        | key2=value2&key3=value3&key7=value7 |
+----------+-------------------------------------+

输出

+----------+-------------------------------------+--------+--------+--------+--------+
| identity | Original                            | key1   | key2   | key3   | key7   |
+----------+-------------------------------------+--------+--------+--------+--------+
| 1        | key1=value1&key2=value2             | value1 | value2 |        |        |
+----------+-------------------------------------+--------+--------+--------+--------+
| 2        | key2=value2&key3=value3&key7=value7 |        | value2 | value3 | value7 |
+----------+-------------------------------------+--------+--------+--------+--------+

我的进步
通过阅读这里的多篇文章,我已经走到了最后一步。以下是我努力满足要求:

  1. Original

    获取密钥的聚合
    val base = spark.read.parquet("path.of.parquet")
    val aggregationKeys = base.select($"Original").rdd.map{
      case observation => {
        val immediate = observation.toString.replaceAll("[\\[\\]]", "").split("&")
        immediate.map(_.split("=")(0))
      }
    }.collect.flatMap(y=>y).sorted.distinct
    
  2. 根据键

    创建新列
    import org.apache.spark.sql.types._  
    import org.apacje.spark.sql.Row  
    val aggregationKeysString = aggregationKeys.mkString("、")  
    val keysFields = aggregationKeysString.split("、")  
      .map(fieldName => StructField(fieldName, StringType, nullable=true)) 
    
    val keysSchema=StructType(keysFields)  
    val keysColumns = spark.createDataFrame(  
      spark.sparkContext.emptyRDD[Row], keysSchema  
      ).withColumn("identity", lit(0))  
    val transformedBase = base.join(keysColumns, Seq("identity"), "left_outer")  
    
  3. [Struggling]如果Original中存在一个键,某些Scala代码在逻辑上等同于,其值将成为步骤2中相应列的内容,作为Example的输出显示。我的想法是在Original的每一行上获取一组键值对,然后将值传递到相应的列

  4. 如何实现第3步的目标?考虑到性能,是否有更好的解决方案来实现这一要求?因为密钥的数量可能高达数百个。

1 个答案:

答案 0 :(得分:0)

在我发布之后,我在本网站上发现a related question解决了我坚持的问题。

解决方案

val base = spark.read.parquet("path.of.parquet")

//Get the whole keys in the form of array
val aggregationKeys = base.select($"Original").rdd.map{
  case observation => {
    val immediate = observation.toString.replaceAll("[\\[\\]]", "").split("&")
    immediate.map(_.split("=")(0))
  }
}.collect.flatMap(y=>y).sorted.distinct

//Turn the collection of array to the collection of map
val columnNames: Seq[String] = aggregationKeys.filter(_.nonEmpty).toSeq
val mapifyValue = udf[Map[String, String], String] {
    s => s.split("&").map(_.split("=")).map{
        case Array(k, v) => k -> v
    }.toMap
}

//Get the result as the output of Example portrays
val stringAsMap = tmp.withColumn("mapifiedOriginal", mapifyValue($"Original"))
val ultimateResult: DataFrame = columnNames.foldLeft(stringAsMap) {
    case (df, colName) => 
            df.withColumn(colName, $"mapifiedOriginal".getItem(colName))
}.drop("mapifiedOriginal")

C' est la vie。