我使用Spark 1.5.2使用以下语法从scala对象创建数据框。我的目的是为单元测试创建数据。
class Address (first:String = null, second: String = null, zip: String = null){}
class Person (id: String = null, name: String = null, address: Seq[Address] = null){}
def test () = {
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val persons = Seq(
new Person(id = "1", name = "Salim",
address = Seq(new Address(first = "1st street"))),
new Person(name = "Sana",
address = Seq(new Address(zip = "60088")))
)
// The code can't infer schema automatically
val claimDF = sqlContext.createDataFrame(sc.parallelize(persons, 2),classOf[Person])
claimDF.printSchema() // This prints "root" not the schema of Person.
}
相反,如果我将Person和Address转换为case类,那么Spark可以使用上述语法或使用sc.parallelize(persons, 2).toDF
或使用sqlContext.createDataFrame(sc.parallelize(persons, 2),StructType)
我无法使用案例类,因为它不能容纳超过20个字段,而且我在课堂上有很多字段。使用StructType会带来很多不便。案例类最方便,但不能容纳太多属性。
请提前帮助,谢谢。
答案 0 :(得分:1)
非常感谢您的投入。
我们最终使用Scala 2.11迁移到Spark 2.1,它支持更大的案例类,因此这个问题得到了解决。
对于Spark 1.6和Scala 2.10,我最终构建了Row对象和Struct类型来构建Dataframe。
val rows = Seq(Row("data"))
val aRDD = sc.parallelize(rows)
val aDF = sqlContext.createDataFrame(aRDD,getSchema())
def getSchema(): StructType= {
StructType(
Array(
StructField("jobNumber", StringType, nullable = true))
)
}
答案 1 :(得分:0)
对代码进行两处更改将使printSchema()在不使用案例类的情况下发出数据框的完整结构。
首先,正如Daniel建议的那样,你需要让你的类扩展scala.Product特征(痛苦,但下面的.toDF
方法需要):
class Address (first:String = null, second: String = null, zip: String = null) extends Product with Serializable
{
override def canEqual(that: Any) = that.isInstanceOf[Address]
override def productArity: Int = 3
def productElement(n: Int) = n match {
case 0 => first; case 1 => second; case 2 => zip
}
}
class Person (id: String = null, name: String = null, address: Seq[Address] = null) extends Product with Serializable
{
override def canEqual(that: Any) = that.isInstanceOf[Person]
override def productArity: Int = 3
def productElement(n: Int) = n match {
case 0 => id; case 1 => name; case 2 => address
}
}
其次,您应该使用.toDF隐式方法创建数据框,该方法使用import sqlContext.implicits._
进入范围,而不是像sqlContext.createDataFrame(..)
那样使用val claimDF = sc.parallelize(persons, 2).toDF
:
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- address: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- first: string (nullable = true)
| | |-- second: string (nullable = true)
| | |-- zip: string (nullable = true)
然后声明DF.printSchema()将打印:
chrome.windows.getAll({ populate: true, windowTypes: [ "app" ]}, callback)
或者,您可以使用Scala 2.11.0-M3删除案例类的22字段限制。