从任意长度的csv列创建spark数据帧

时间:2017-05-08 20:34:32

标签: scala apache-spark

我正在尝试从数据框中的单个csv格式列创建新的数据框。我提前不知道架构所以我试图使用没有架构参数的spark.createDataFrame方法(类似于this example中的方法1)

我正在尝试以下代码,但会抛出异常:

var csvrdd = df.select(df("Body").cast("string")).rdd.map{x:Row => x.getAs[String](0)}.map(x => x.split(",").toSeq)
var dfWithoutSchema = spark.createDataFrame(csvrdd)

错误:

error: overloaded method value createDataFrame with alternatives:
  [A <: Product](data: Seq[A])(implicit evidence$3: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame <and>
  [A <: Product](rdd: org.apache.spark.rdd.RDD[A])(implicit evidence$2: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame
 cannot be applied to (org.apache.spark.rdd.RDD[Seq[String]])
       var dfWithoutSchema = spark.createDataFrame(csvrdd)

1 个答案:

答案 0 :(得分:1)

首先,在createDataFrame

的签名中可以清楚地看到失败的原因
def createDataFrame[A <: Product : TypeTag](rdd: RDD[A]): DataFrame

类型A 有界scala.Product的子类。您的RDD包含Array[String],它不是这样的子类。如果你真的想要,你可以人为地将数组包装在Tuple1(扩展Product)并使其工作:

val csvrdd: RDD[Tuple1[Array[String]]] = df
  .select(df("Body").cast("string"))
  .rdd
  .map{ x:Row => x.getAs[String](0)}
  .map(x => Tuple1(x.split(","))) // wrapping with a Tuple1, which extends scala.Product

val dfWithoutSchema = spark.createDataFrame(csvrdd) // now this overload works

dfWithoutSchema.printSchema()
// root
// |-- _1: array (nullable = true)
// |    |-- element: string (containsNull = true)

然而 - 这似乎不太有用。这将创建一个DataFrame,其中包含ArrayType类型的单个列。这可以通过split

中更简单的org.apache.spark.sql.functions函数来实现
val withArray = df.select(split(df("Body").cast("string"), ",") as "arr")

withArray.printSchema()
// root
//  |-- arr: array (nullable = true)
//  |    |-- element: string (containsNull = true)

另外,如果你希望获得的是每个&#34; CSV列&#34;的单独列的DataFrame,你就可以了必须&#34;决定&#34;在所有记录的通用模式上(并非所有记录都具有相同数量的&#34; CSV部分&#34;)。您可以通过添加DataFrame的另一个扫描并计算最大所需列数来实现,然后让Spark&#34;填写空白&#34;使用null s,其中实际值包含较少的部分:

// first - split String into array of Strings
val withArray = df.select(split(df("Body").cast("string"), ",") as "arr")

// optional - calculate the *maximum* number of columns;
// If you know it to begin with (e.g. "CSV cannot exceed X columns") -
// you can skip this and use that known value
val maxLength: Int = withArray.select(size($"arr") as "size")
  .groupBy().max("size")
  .first().getAs[Int](0)

// Create the individual columns, with nulls where the arrays were shorted than maxLength
val columns = (0 until maxLength).map(i => $"arr".getItem(i) as s"col$i")

// select these columns
val result = withArray.select(columns: _*)

result.printSchema() // in my example, maxLength = 4
// root
//  |-- col0: string (nullable = true)
//  |-- col1: string (nullable = true)
//  |-- col2: string (nullable = true)
//  |-- col3: string (nullable = true)