创建数据帧时出错:java.lang.RuntimeException:scala.Tuple2不是字符串模式的有效外部类型

时间:2017-03-15 09:19:40

标签: scala apache-spark dataframe apache-spark-sql

我创建了一个包含以下代码的模式

val schema=  new StructType().add("city", StringType, true).add("female", IntegerType, true).add("male", IntegerType, true)

创建RDD
val data = spark.sparkContext.textFile("cities.txt")

转换为Row的RDD以应用架构

    val cities = data.map(line => line.split(";")).map(row => Row.fromSeq(row.zip(schema.toSeq)))


 val citiesRDD = spark.sqlContext.createDataFrame(cities, schema)

这给了我一个错误

java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of string

1 个答案:

答案 0 :(得分:1)

您不需要架构来创建Row,在创建DataFrame时需要架构。您还需要介绍一些逻辑如何将分割线(产生3个字符串)转换为整数:

这里是一个没有异常处理的最小解决方案:

val data = sc.parallelize(Seq("Bern;10;12")) // mock for real data

val schema = new StructType().add("city", StringType, true).add("female", IntegerType, true).add("male", IntegerType, true)

val cities = data.map(line => {
val Array(city,female,male) = line.split(";")
  Row(
    city,
    female.toInt,
    male.toInt
  )
 }
)

val citiesDF = sqlContext.createDataFrame(cities, schema)

我通常使用case-classes来创建数据帧,因为spark可以从case类中推断出架构:

// "schema" for dataframe, define outside of main method
case class MyRow(city:Option[String],female:Option[Int],male:Option[Int]) 

val data = sc.parallelize(Seq("Bern;10;12")) // mock for real data

import sqlContext.implicits._

val citiesDF = data.map(line => {
val Array(city,female,male) = line.split(";")
  MyRow(
    Some(city),
    Some(female.toInt),
    Some(male.toInt)
  )
}
).toDF()