Question

我的rdd仅是一列。每列都是一个字符串，代表由|分隔的条目列表。例如：

  col_1
 a|b|c|d
 q|w|e|r

我想将其转换为数据帧，所以就像这样：

col_1 | col_2 | col_3 | col_4
 a        b       c        d
 q        w       e        r

列数是未知的，不需要标题（它们可以只是默认列名称）。

我尝试过：

.map(i => i.split("|")).toDF()

但是，这仅返回一个由值数组组成的单列，而不是实际拆分为列。最终目的是将其写入镶木地板文件中。

一种解决方案是将其写入文本文件，然后使用给定的定界符以Spark作为csv的格式将其读取，然后将其写入镶木地板文件中。但这是一种可怕的方法，必须有一种更好的方法来实现。

Answer 1

DataFrame必须具有预定义的架构，因此您必须以某种方式提供列数。如果不同的记录可能有不同数量的定界符，则必须对数据进行两次扫描（一次确定列，然后一次转换为DataFrame）；否则，“偷看”第一条记录就足够了：

import spark.implicits._

// note the necessary escaping because | is a special character in regular expressions
val arrays = rdd.map(_.split("\\|")) 

// if not all values have the same number of delimiters:
val maxCols = arrays.map(_.length).max()

// otherwise - can use first record to determine number of columns:
val maxCols = arrays.first().length

// now we create a column per (1 .. maxCols) and select these:
val result = arrays.toDF("arr")
  .select((0 until maxCols).map(i => $"arr"(i).as(s"col_$i")): _*)

result.show()
+-----+-----+-----+-----+
|col_0|col_1|col_2|col_3|
+-----+-----+-----+-----+
|    a|    b|    c|    d|
|    q|    w|    e|    r|
+-----+-----+-----+-----+

单列定界字符串rdd到正确列的数据帧

1 个答案: