Question

如何使用input5中提到的架构详细信息将schemanames数据格式转换为DataFrame？

转换应该是动态的而不使用Row(r(0),r(1)) - 列数可以在输入和架构中增加或减少，因此代码应该是动态的。

case class Entry(schemaName: String, updType: String, ts: Long, row: Map[String, String])

val input5 = List(Entry("a","b",0,Map("col1 " -> "0000555", "ref" -> "2017-08-12 12:12:12.266528")))  

val schemanames= "col1,ref"

目标数据框应仅来自输入5的地图（如col1和ref）。可以有许多其他列（例如col2，col3 ...）。如果Map中有更多列，则架构名称中将提及相同的列。

模式名称变量应该用于创建结构，input5.row（Map）应该是数据源...因为模式名称中的列数可以是100，同样适用于Input5.row中的数据

Answer 1

这适用于任意数量的列，只要它们是所有字符串，并且每个Entry都包含一个地图，其中包含所有这些列的值：

// split to column names:
val columns = schemanames.split(",")

// create the DataFrame schema with these columns (in that order)
val schema = StructType(columns.map(StructField(_, StringType)))

// convert input5 to Seq[Row], while selecting the values from "row" Map in same order of columns
val rows = input5.map(_.row)
  .map(valueMap => columns.map(valueMap.apply).toSeq)
  .map(Row.fromSeq)

// finally - create dataframe
val dataframe = spark.createDataFrame(sc.parallelize(rows), schema)

Answer 2

您可以浏览schemanames中的条目（可能是根据您的描述在地图中选择的键）以及用于地图操作的UDF来组装数据帧，如下所示：

case class Entry(schemaName: String, updType: String, ts: Long, row: Map[String, String])

val input5 = List(
  Entry("a", "b", 0, Map("col1" -> "0000555", "ref" -> "2017-08-12 12:12:12.266528")),
  Entry("c", "b", 1, Map("col1" -> "0000444", "col2" -> "0000444", "ref" -> "2017-08-14 14:14:14.0")),
  Entry("a", "d", 0, Map("col2" -> "0000666", "ref" -> "2017-08-16 16:16:16.0")),
  Entry("e", "f", 0, Map("col1" -> "0000777", "ref" -> "2017-08-17 17:17:17.0", "others" -> "x"))
)  

val schemanames= "col1, ref"

// Create dataframe from input5
val df = input5.toDF

// A UDF to get column value from Map
def getColVal(c: String) = udf(
  (m: Map[String, String]) =>
     m.get(c).getOrElse("n/a")
)

// Add columns based on entries in schemanames
val df2 = schemanames.split(",").map(_.trim).
  foldLeft( df )(
    (acc, c) => acc.withColumn( c, getColVal(c)(df("row"))
  ))

val df3 = df2.select(cols.map(c => col(c)): _*)

df3.show(truncate=false)
+-------+--------------------------+
|col1   |ref                       |
+-------+--------------------------+
|0000555|2017-08-12 12:12:12.266528|
|0000444|2017-08-14 14:14:14.0     |
|n/a    |2017-08-16 16:16:16.0     |
|0000777|2017-08-17 17:17:17.0     |
+-------+--------------------------+

使用N个元素

2 个答案: