如何在Spark中将任意长度的Array [String]转换为一行DataFrame

时间:2019-02-27 21:50:07

标签: scala apache-spark

我有一个任意长度的Array [String],例如:

val strs = Array[String]("id","value","group","ts")

如何将其传输到DataFrame如下:

+-----+------+-------+----+
|_0   | _1   | _2    | _3 |
+-----+------+-------+----+
|   id| value| group | ts |

我尝试过的解决方案:

代码:

spark.sparkContext.parallelize(List((strs.toList))).toDF().show()

spark.sparkContext.parallelize(List(strs)).toDF().show()

结果:

+--------------------+
|               value|
+--------------------+
|[id, value, group...|
+--------------------+

代码:

spark.sparkContext.parallelize(strs).toDF().show()

结果:

+-----+
|value|
+-----+
|   id|
|value|
|group|
|   ts|
+-----+

不是我真正想要的。

我知道解决方案是:

 val data1 = List(
      (1,"A","X",1),
      (2,"B","X",2),
      (3,"C",null,3),
      (3,"D","C",3),
      (4,"E","D",3)
    ).toDF("id","value","group","ts").show()

但是在我的情况下,Array [String]是任意长度的

2 个答案:

答案 0 :(得分:0)

val strs = Array[String]("id","value","group","ts")
val list_of_strs  = List[Array[String]]() :+ strs
spark.sparkContext.parallelize(list_of_strs)
  .map { case Array(s1,s2,s3,s4) => (s1,s2,s3,s3) }
  .toDF().show()

问题显然是用单个元素创建一个列表,而该元素也是一个集合。我想解决方案是先创建一个空列表,然后添加单个元素。

与更新一样,这似乎是我们未处理元组的问题,这也许也可以解决:

val strs = Array[String]("id","value","group","ts")
spark.sparkContext.parallelize(List(strs))
  .map { case Array(s1,s2,s3,s4) => (s1,s2,s3,s3) }
  .toDF().show()

但是我不认为我们可以处理任意长度的数组,因为这将导致具有任意长度的元组...这没有意义,因为对于一个DataFrame,我们也要处理一个固定定义(列数和列类型)。如果确实发生这种情况,那么您将不得不用空白填充其余的元组项目,并使用最大的元组。

答案 1 :(得分:0)

我认为,这是关于how to transfer a List to Tuple的问题,我尝试了以下解决方案:

    val strs = Array[String]("id","value","group","ts")

    def listToTuple(list:Seq[Object]):Product = {
      val clas = Class.forName("scala.Tuple" + list.size)
      clas.getConstructors.apply(0).newInstance(list:_*).asInstanceOf[Product]
    }

    val aa = listToTuple(strs.toSeq)

我可以将数组或列表转换为元组,但是当我尝试将其转换为DataFrame时:

List(listToTuple(strs.toSeq)).toDF().show()

然后我得到一个例外:

Exception in thread "main" scala.ScalaReflectionException: <none> is not a term
    at scala.reflect.api.Symbols$SymbolApi$class.asTerm(Symbols.scala:199)
    at scala.reflect.internal.Symbols$SymbolContextApiImpl.asTerm(Symbols.scala:84)
    at org.apache.spark.sql.catalyst.ScalaReflection$class.constructParams(ScalaReflection.scala:858)
    at org.apache.spark.sql.catalyst.ScalaReflection$.constructParams(ScalaReflection.scala:39)
    at org.apache.spark.sql.catalyst.ScalaReflection$class.getConstructorParameters(ScalaReflection.scala:839)
    at org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:39)
    at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:606)
    at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:438)
    at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
    at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
    at org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233)
    at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33)