我正在尝试从嵌套的RDD创建数据框,当然我可以使用toDF()
方法,但我的RDD是一个在scala 2.10中有超过100个字段的案例类,如下所示:
case class User
(
val user_id: String = "",
valuser_name: String = ""
) extends UserExtended
class UserExtended extends Serializable
{
val user_adress: Option[String] = Some("")
// 100 more
val cards : Array[Cards] = Array[Cards]()
}
使得toDF()方法无用,因为它不接受继承类中的信息。但无论如何,为了创建我的数据帧,我按照文档中的内容进行了操作,并附带了这段代码:
def createDataframe(users: RDD[User])(implicit sqlContext:SQLContext) = {
val userInfos = users.map { user =>
val buffer = ArrayBuffer.empty[Any]
buffer.append(user.idUser)
// ... containts more than 100 fields
buffer.append(user.userTrophies)
val cards = user.cards.map { card =>
val cardBuffer = ArrayBuffer.empty[Any]
cardBuffer.append(card.cardName)
cardBuffer.append(card.dps)
Row.fromSeq(cardBuffer)
}.toSeq
buffer.append(cards)
Row.fromSeq(buffer)
}
val schema = StructType(Seq(
StructField("id_user",StringType,false),
StructField("user_trophies",StringType,false),
StructField("cards",ArrayType(StructType(Seq(
StructField("card_name",StringType,false),
StructField("dps",StringType,false)
))))
))
sqlContext.createDataFrame(userInfos, schema)
}
不幸的是,当我运行单元测试时,我得到了这个例外:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost):
scala.MatchError: WrappedArray([Gobelin,100], [Giant,500]) (of class scala.collection.mutable.WrappedArray$ofRef)
这很奇怪,因为当我在shell上创建的嵌套RDD上使用toDF()
方法时,嵌套列是WrappedArray。
我花了几天时间试图找到我的代码中的错误,我仍然坚持这一点,我将非常感谢任何帮助。
感谢。