Question

我正在尝试从数据框中的单个csv格式列创建新的数据框。我提前不知道架构所以我试图使用没有架构参数的spark.createDataFrame方法（类似于this example中的方法1）

我正在尝试以下代码，但会抛出异常：

var csvrdd = df.select(df("Body").cast("string")).rdd.map{x:Row => x.getAs[String](0)}.map(x => x.split(",").toSeq)
var dfWithoutSchema = spark.createDataFrame(csvrdd)

错误：

error: overloaded method value createDataFrame with alternatives:
  [A <: Product](data: Seq[A])(implicit evidence$3: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame <and>
  [A <: Product](rdd: org.apache.spark.rdd.RDD[A])(implicit evidence$2: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame
 cannot be applied to (org.apache.spark.rdd.RDD[Seq[String]])
       var dfWithoutSchema = spark.createDataFrame(csvrdd)

Answer 1

首先，在createDataFrame：

的签名中可以清楚地看到失败的原因

def createDataFrame[A <: Product : TypeTag](rdd: RDD[A]): DataFrame

类型A 有界是scala.Product的子类。您的RDD包含Array[String]，它不是这样的子类。如果你真的想要，你可以人为地将数组包装在Tuple1（扩展Product）并使其工作：

val csvrdd: RDD[Tuple1[Array[String]]] = df
  .select(df("Body").cast("string"))
  .rdd
  .map{ x:Row => x.getAs[String](0)}
  .map(x => Tuple1(x.split(","))) // wrapping with a Tuple1, which extends scala.Product

val dfWithoutSchema = spark.createDataFrame(csvrdd) // now this overload works

dfWithoutSchema.printSchema()
// root
// |-- _1: array (nullable = true)
// |    |-- element: string (containsNull = true)

然而 - 这似乎不太有用。这将创建一个DataFrame，其中包含ArrayType类型的单个列。这可以通过split：

中更简单的org.apache.spark.sql.functions函数来实现

val withArray = df.select(split(df("Body").cast("string"), ",") as "arr")

withArray.printSchema()
// root
//  |-- arr: array (nullable = true)
//  |    |-- element: string (containsNull = true)

另外，如果你希望获得的是每个＆＃34; CSV列＆＃34;的单独列的DataFrame，你就可以了必须＆＃34;决定＆＃34;在所有记录的通用模式上（并非所有记录都具有相同数量的＆＃34; CSV部分＆＃34;）。您可以通过添加DataFrame的另一个扫描并计算最大所需列数来实现，然后让Spark＆＃34;填写空白＆＃34;使用null s，其中实际值包含较少的部分：

// first - split String into array of Strings
val withArray = df.select(split(df("Body").cast("string"), ",") as "arr")

// optional - calculate the *maximum* number of columns;
// If you know it to begin with (e.g. "CSV cannot exceed X columns") -
// you can skip this and use that known value
val maxLength: Int = withArray.select(size($"arr") as "size")
  .groupBy().max("size")
  .first().getAs[Int](0)

// Create the individual columns, with nulls where the arrays were shorted than maxLength
val columns = (0 until maxLength).map(i => $"arr".getItem(i) as s"col$i")

// select these columns
val result = withArray.select(columns: _*)

result.printSchema() // in my example, maxLength = 4
// root
//  |-- col0: string (nullable = true)
//  |-- col1: string (nullable = true)
//  |-- col2: string (nullable = true)
//  |-- col3: string (nullable = true)

从任意长度的csv列创建spark数据帧

1 个答案: