要求
展开一列,其中一串具有可变长度的键值存储到多个列,使用来自镶木地板的Scala API,使用Spark的相应值。
示例
输入
+----------+-------------------------------------+
| identity | Original |
+----------+-------------------------------------+
| 1 | key1=value1&key2=value2 |
+----------+-------------------------------------+
| 2 | key2=value2&key3=value3&key7=value7 |
+----------+-------------------------------------+
输出
+----------+-------------------------------------+--------+--------+--------+--------+
| identity | Original | key1 | key2 | key3 | key7 |
+----------+-------------------------------------+--------+--------+--------+--------+
| 1 | key1=value1&key2=value2 | value1 | value2 | | |
+----------+-------------------------------------+--------+--------+--------+--------+
| 2 | key2=value2&key3=value3&key7=value7 | | value2 | value3 | value7 |
+----------+-------------------------------------+--------+--------+--------+--------+
我的进步
通过阅读这里的多篇文章,我已经走到了最后一步。以下是我努力满足要求:
从Original
val base = spark.read.parquet("path.of.parquet")
val aggregationKeys = base.select($"Original").rdd.map{
case observation => {
val immediate = observation.toString.replaceAll("[\\[\\]]", "").split("&")
immediate.map(_.split("=")(0))
}
}.collect.flatMap(y=>y).sorted.distinct
根据键
创建新列import org.apache.spark.sql.types._
import org.apacje.spark.sql.Row
val aggregationKeysString = aggregationKeys.mkString("、")
val keysFields = aggregationKeysString.split("、")
.map(fieldName => StructField(fieldName, StringType, nullable=true))
val keysSchema=StructType(keysFields)
val keysColumns = spark.createDataFrame(
spark.sparkContext.emptyRDD[Row], keysSchema
).withColumn("identity", lit(0))
val transformedBase = base.join(keysColumns, Seq("identity"), "left_outer")
[Struggling]如果Original
中存在一个键,某些Scala代码在逻辑上等同于,其值将成为步骤2中相应列的内容,作为Example的输出显示。我的想法是在Original
的每一行上获取一组键值对,然后将值传递到相应的列
如何实现第3步的目标?考虑到性能,是否有更好的解决方案来实现这一要求?因为密钥的数量可能高达数百个。
答案 0 :(得分:0)
在我发布之后,我在本网站上发现a related question解决了我坚持的问题。
val base = spark.read.parquet("path.of.parquet")
//Get the whole keys in the form of array
val aggregationKeys = base.select($"Original").rdd.map{
case observation => {
val immediate = observation.toString.replaceAll("[\\[\\]]", "").split("&")
immediate.map(_.split("=")(0))
}
}.collect.flatMap(y=>y).sorted.distinct
//Turn the collection of array to the collection of map
val columnNames: Seq[String] = aggregationKeys.filter(_.nonEmpty).toSeq
val mapifyValue = udf[Map[String, String], String] {
s => s.split("&").map(_.split("=")).map{
case Array(k, v) => k -> v
}.toMap
}
//Get the result as the output of Example portrays
val stringAsMap = tmp.withColumn("mapifiedOriginal", mapifyValue($"Original"))
val ultimateResult: DataFrame = columnNames.foldLeft(stringAsMap) {
case (df, colName) =>
df.withColumn(colName, $"mapifiedOriginal".getItem(colName))
}.drop("mapifiedOriginal")
C' est la vie。