当我尝试在下面的代码中创建具有架构的数据框时,它将无法正常工作,如果没有架构,则所有列数据都合并为一个列
#transformations
val t3 = t1.map{case(a)=>(a(1).toInt,a(2))}.reduceByKey((x,y)=> `
(x+","+y)).map{case(a,b)=>parse(a,b)}
解析函数返回Array [Int]。
代码在这里显示
`t3.collect()`
res7: Array[Array[Int]] = Array(Array(100, 1, 1, 0, 0, 0, 2), Array(104,
2, 0, 0, 0, 1, 3))
#schema column names
`temp`
res11: List[String] = List(id, review, inprogress, notstarted, completed,
started, total)
`val fields = temp.map(fieldName => StructField(fieldName,
IntegerType, nullable = true))`
fields: List[org.apache.spark.sql.types.StructField]
#creating schema
`val schema = StructType(fields)`
org.apache.spark.sql.types.StructType
`val df = t3.toDF()`
org.apache.spark.sql.DataFrame = [value: array<int>]
`df.show()`
+--------------------+
| value|
+--------------------+
|[100, 1, 1, 0, 0,...|
|[104, 2, 0, 0, 0,...|
+--------------------+
`val df = t3.toDF(schema)`
error: type mismatch;
`val df = spark.createDataFrame(t3)`
<console>:35: error: overloaded method value createDataFrame with
alternatives
Expected:
+---+---------+----------+----------+------+-------+-----+
| id|completed|inprogress|notstarted|review|started|total|
+---+---------+----------+----------+------+-------+-----+
|100| 0| 1| 0| 1| 0| 2|
|104| 0| 0| 0| 2| 1| 3|
+---------+---+----------+----------+------+-------+-----+
答案 0 :(得分:0)
您从Spark文档中获得
def toDF(colNames: String*): DataFrame
但是,您将StructType
实例传递给toDF
函数。
您可以使用Dataframe
(将其转换为t3.toDF(temp:_*)
toDF("id",.., "total")
进一步,您应该使用Array[(Int,..,Int)]
而不是Array[Array[Int]]
答案 1 :(得分:0)
RDD [Array [Int]]可以转换为RDD [Row],然后转换为DataFrame:
val parsedData = Array(Array(100, 1, 1, 0, 0, 0, 2), Array(104,
2, 0, 0, 0, 1, 3))
val rddAfterParsing = sparkContext.parallelize(parsedData)
val rddOfRows = rddAfterParsing.map(arr => Row(arr: _*))
val columnNames = Seq("id", "review", "inprogress", "notstarted", "completed", "started", "total")
val fields = columnNames.map(fieldName => StructField(fieldName,
IntegerType, nullable = true))
val result = spark.createDataFrame(rddOfRows, StructType(fields))
result.show(false)
输出:
+---+------+----------+----------+---------+-------+-----+
|id |review|inprogress|notstarted|completed|started|total|
+---+------+----------+----------+---------+-------+-----+
|100|1 |1 |0 |0 |0 |2 |
|104|2 |0 |0 |0 |1 |3 |
+---+------+----------+----------+---------+-------+-----+