我在Scala / Spark的行级进行一些计算。我有一个使用JSON创建的数据框 -
{"available":false,"createTime":"2016-01-08","dataValue":{"names_source":{"first_names":["abc", "def"],"last_names_id":[123,456]},"another_source_array":[{"first":"1.1","last":"ONE"}],"another_source":"TableSources","location":"GMP", "timestamp":"2018-02-11"},"deleteTime":"2016-01-08"}
您可以直接使用此JSON创建数据框。我的架构如下所示 -
root
|-- available: boolean (nullable = true)
|-- createTime: string (nullable = true)
|-- dataValue: struct (nullable = true)
| |-- another_source: string (nullable = true)
| |-- another_source_array: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- first: string (nullable = true)
| | | |-- last: string (nullable = true)
| |-- location: string (nullable = true)
| |-- names_source: struct (nullable = true)
| | |-- first_names: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- last_names_id: array (nullable = true)
| | | |-- element: long (containsNull = true)
| |-- timestamp: string (nullable = true)
|-- deleteTime: string (nullable = true)
我正在使用readSchema单独阅读所有列并使用writeSchema进行编写。在两个复杂的列中,我能够处理一个而不是其他列。
以下是我阅读架构的一部分 -
.add("names_source", StructType(
StructField("first_names", ArrayType.apply(StringType)) ::
StructField("last_names_id", ArrayType.apply(DoubleType)) ::
Nil
))
.add("another_source_array", ArrayType(StructType(
StructField("first", StringType) ::
StructField("last", StringType) ::
Nil
)))
这是我的写模式的一部分 -
.add("names_source", StructType.apply(Seq(
StructField("first_names", StringType),
StructField("last_names_id", DoubleType))
))
.add("another_source_array", ArrayType(StructType.apply(Seq(
StructField("first", StringType),
StructField("last", StringType))
)))
在处理中,我使用的方法来索引所有列。下面是我的函数代码 -
def myMapRedFunction(df: DataFrame, spark: SparkSession): DataFrame = {
val columnIndex = dataIndexingSchema.fieldNames.zipWithIndex.toMap
val myRDD = df.rdd
.map(row => {
Row(
row.getAs[Boolean](columnIndex("available")),
parseDate(row.getAs[String](columnIndex("create_time"))),
??I Need help here??
row.getAs[String](columnIndex("another_source")),
anotherSourceArrayFunction(row.getSeq[Row](columnIndex("another_source_array"))),
row.getAs[String](columnIndex("location")),
row.getAs[String](columnIndex("timestamp")),
parseDate(row.getAs[String](columnIndex("delete_time")))
)
}).distinct
spark.createDataFrame(myRDD, dataWriteSchema)
}
another_source_array
方法正在处理 anotherSourceArrayFunction
列,以确保我们根据要求获取架构。我需要一个类似的函数来获取names_source
列。以下是我用于another_source_array
列的功能。
def anotherSourceArrayFunction(data: Seq[Row]): Seq[Row] = {
if (data == null) {
data
} else {
data.map(r => {
val first = r.getAs[String]("first").ToUpperCase()
val last = r.getAs[String]("last")
new GenericRowWithSchema(Array(first,last), StructType(
StructField("first", StringType) ::
StructField("last", StringType) ::
Nil
))
})
}
}
可能简而言之,我需要类似this的内容,我可以将names_source
列结构作为结构。
names_source:struct<first_names:array<string>,last_names_id:array<bigint>>
another_source_array:array<struct<first:string,last:string>>
以上是最终需要的列模式。我能够another_source_array
正确地获得names_source
的帮助。我认为这个专栏的写模式是正确的,但我不确定。但我最终需要names_source:struct<first_names:array<string>,last_names_id:array<bigint>>
作为列模式。
注意:我可以毫不费力地获得another_source_array
列。我保留了这个功能,以便更好地理解。
答案 0 :(得分:4)
从我在您尝试过的所有代码中看到的是您试图将struct dataValue
列展平为单独的列。
如果我的假设是正确的,那么你就不必经历这种复杂性。您可以简单地执行以下操作
val myRDD = df.rdd
.map(row => {
Row(
row.getAs[Boolean]("available"),
parseDate(row.getAs[String]("createTime")),
row.getAs[Row]("dataValue").getAs[Row]("names_source"),
row.getAs[Row]("dataValue").getAs[String]("another_source"),
row.getAs[Row]("dataValue").getAs[Seq[Row]]("another_source_array"),
row.getAs[Row]("dataValue").getAs[String]("location"),
row.getAs[Row]("dataValue").getAs[String]("timestamp"),
parseDate(row.getAs[String]("deleteTime"))
)
}).distinct
import org.apache.spark.sql.types._
val dataWriteSchema = StructType(Seq(
StructField("createTime", DateType, true),
StructField("createTime", StringType, true),
StructField("names_source", StructType(Seq(StructField("first_names", ArrayType(StringType), true), StructField("last_names_id", ArrayType(LongType), true))), true),
StructField("another_source", StringType, true),
StructField("another_source_array", ArrayType(StructType.apply(Seq(StructField("first", StringType),StructField("last", StringType)))), true),
StructField("location", StringType, true),
StructField("timestamp", StringType, true),
StructField("deleteTime", DateType, true)
))
spark.createDataFrame(myRDD, dataWriteSchema).show(false)
您可以在 struct column 上使用.*
,使struct column的元素位于不同的列上
import org.apache.spark.sql.functions._
df.select(col("available"), col("createTime"), col("dataValue.*"), col("deleteTime")).show(false)
您将必须在此方法中将字符串日期列更改为dateType
在这两种情况下,你都会得到输出
+---------+----------+-----------------------------------------------+--------------+--------------------+--------+----------+----------+
|available|createTime|names_source |another_source|another_source_array|location|timestamp |deleteTime|
+---------+----------+-----------------------------------------------+--------------+--------------------+--------+----------+----------+
|false |2016-01-08|[WrappedArray(abc, def),WrappedArray(123, 456)]|TableSources |[[1.1,ONE]] |GMP |2018-02-11|2016-01-08|
+---------+----------+-----------------------------------------------+--------------+--------------------+--------+----------+----------+
我希望答案很有帮助