无法使用指定的架构创建df

时间:2019-05-14 06:12:39

标签: scala apache-spark apache-spark-sql rdd

当我尝试在下面的代码中创建具有架构的数据框时,它将无法正常工作,如果没有架构,则所有列数据都合并为一个列

#transformations
val t3 = t1.map{case(a)=>(a(1).toInt,a(2))}.reduceByKey((x,y)=> `  
(x+","+y)).map{case(a,b)=>parse(a,b)}  

解析函数返回Array [Int]。

代码在这里显示

`t3.collect()`  
res7: Array[Array[Int]] = Array(Array(100, 1, 1, 0, 0, 0, 2), Array(104,  
2, 0, 0, 0, 1, 3))  
#schema column names
`temp`  
res11: List[String] = List(id, review, inprogress, notstarted, completed,   
started, total)  

`val fields = temp.map(fieldName => StructField(fieldName,   
IntegerType, nullable = true))`  
fields: List[org.apache.spark.sql.types.StructField]  
#creating schema
`val schema = StructType(fields)`  
org.apache.spark.sql.types.StructType  

`val df = t3.toDF()`  
org.apache.spark.sql.DataFrame = [value: array<int>]  

`df.show()`  
+--------------------+  
|               value|  
+--------------------+  
|[100, 1, 1, 0, 0,...|  
|[104, 2, 0, 0, 0,...|  
+--------------------+  

`val df = t3.toDF(schema)`  
error: type mismatch;  


`val df = spark.createDataFrame(t3)`  
<console>:35: error: overloaded method value createDataFrame with   
alternatives  

Expected:  
+---+---------+----------+----------+------+-------+-----+  
| id|completed|inprogress|notstarted|review|started|total|  
+---+---------+----------+----------+------+-------+-----+  
|100|        0|         1|         0|     1|      0|    2|  
|104|        0|         0|         0|     2|      1|    3|  
+---------+---+----------+----------+------+-------+-----+  

2 个答案:

答案 0 :(得分:0)

您从Spark文档中获得

def toDF(colNames: String*): DataFrame

但是,您将StructType实例传递给toDF函数。

您可以使用Dataframe(将其转换为t3.toDF(temp:_*)

来创建第二个toDF("id",.., "total")

进一步,您应该使用Array[(Int,..,Int)]而不是Array[Array[Int]]

答案 1 :(得分:0)

具有解析数据的

RDD [Array [Int]]可以转换为RDD [Row],然后转换为DataFrame:

val parsedData = Array(Array(100, 1, 1, 0, 0, 0, 2), Array(104,
  2, 0, 0, 0, 1, 3))
val rddAfterParsing = sparkContext.parallelize(parsedData)
val rddOfRows = rddAfterParsing.map(arr => Row(arr: _*))

val columnNames = Seq("id", "review", "inprogress", "notstarted", "completed", "started", "total")
val fields = columnNames.map(fieldName => StructField(fieldName,
  IntegerType, nullable = true))
val result = spark.createDataFrame(rddOfRows, StructType(fields))

result.show(false)

输出:

+---+------+----------+----------+---------+-------+-----+
|id |review|inprogress|notstarted|completed|started|total|
+---+------+----------+----------+---------+-------+-----+
|100|1     |1         |0         |0        |0      |2    |
|104|2     |0         |0         |0        |1      |3    |
+---+------+----------+----------+---------+-------+-----+