我正在尝试从数据框中的单个csv格式列创建新的数据框。我提前不知道架构所以我试图使用没有架构参数的spark.createDataFrame方法(类似于this example中的方法1)
我正在尝试以下代码,但会抛出异常:
var csvrdd = df.select(df("Body").cast("string")).rdd.map{x:Row => x.getAs[String](0)}.map(x => x.split(",").toSeq)
var dfWithoutSchema = spark.createDataFrame(csvrdd)
错误:
error: overloaded method value createDataFrame with alternatives:
[A <: Product](data: Seq[A])(implicit evidence$3: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame <and>
[A <: Product](rdd: org.apache.spark.rdd.RDD[A])(implicit evidence$2: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.rdd.RDD[Seq[String]])
var dfWithoutSchema = spark.createDataFrame(csvrdd)
答案 0 :(得分:1)
首先,在createDataFrame
:
def createDataFrame[A <: Product : TypeTag](rdd: RDD[A]): DataFrame
类型A
有界是scala.Product
的子类。您的RDD包含Array[String]
,它不是这样的子类。如果你真的想要,你可以人为地将数组包装在Tuple1
(扩展Product
)并使其工作:
val csvrdd: RDD[Tuple1[Array[String]]] = df
.select(df("Body").cast("string"))
.rdd
.map{ x:Row => x.getAs[String](0)}
.map(x => Tuple1(x.split(","))) // wrapping with a Tuple1, which extends scala.Product
val dfWithoutSchema = spark.createDataFrame(csvrdd) // now this overload works
dfWithoutSchema.printSchema()
// root
// |-- _1: array (nullable = true)
// | |-- element: string (containsNull = true)
然而 - 这似乎不太有用。这将创建一个DataFrame,其中包含ArrayType
类型的单个列。这可以通过split
:
org.apache.spark.sql.functions
函数来实现
val withArray = df.select(split(df("Body").cast("string"), ",") as "arr")
withArray.printSchema()
// root
// |-- arr: array (nullable = true)
// | |-- element: string (containsNull = true)
另外,如果你希望获得的是每个&#34; CSV列&#34;的单独列的DataFrame,你就可以了必须&#34;决定&#34;在所有记录的通用模式上(并非所有记录都具有相同数量的&#34; CSV部分&#34;)。您可以通过添加DataFrame的另一个扫描并计算最大所需列数来实现,然后让Spark&#34;填写空白&#34;使用null
s,其中实际值包含较少的部分:
// first - split String into array of Strings
val withArray = df.select(split(df("Body").cast("string"), ",") as "arr")
// optional - calculate the *maximum* number of columns;
// If you know it to begin with (e.g. "CSV cannot exceed X columns") -
// you can skip this and use that known value
val maxLength: Int = withArray.select(size($"arr") as "size")
.groupBy().max("size")
.first().getAs[Int](0)
// Create the individual columns, with nulls where the arrays were shorted than maxLength
val columns = (0 until maxLength).map(i => $"arr".getItem(i) as s"col$i")
// select these columns
val result = withArray.select(columns: _*)
result.printSchema() // in my example, maxLength = 4
// root
// |-- col0: string (nullable = true)
// |-- col1: string (nullable = true)
// |-- col2: string (nullable = true)
// |-- col3: string (nullable = true)