我有一个任意长度的Array [String],例如:
val strs = Array[String]("id","value","group","ts")
如何将其传输到DataFrame如下:
+-----+------+-------+----+
|_0 | _1 | _2 | _3 |
+-----+------+-------+----+
| id| value| group | ts |
我尝试过的解决方案:
代码:
spark.sparkContext.parallelize(List((strs.toList))).toDF().show()
或
spark.sparkContext.parallelize(List(strs)).toDF().show()
结果:
+--------------------+
| value|
+--------------------+
|[id, value, group...|
+--------------------+
代码:
spark.sparkContext.parallelize(strs).toDF().show()
结果:
+-----+
|value|
+-----+
| id|
|value|
|group|
| ts|
+-----+
不是我真正想要的。
我知道解决方案是:
val data1 = List(
(1,"A","X",1),
(2,"B","X",2),
(3,"C",null,3),
(3,"D","C",3),
(4,"E","D",3)
).toDF("id","value","group","ts").show()
但是在我的情况下,Array [String]是任意长度的
答案 0 :(得分:0)
val strs = Array[String]("id","value","group","ts")
val list_of_strs = List[Array[String]]() :+ strs
spark.sparkContext.parallelize(list_of_strs)
.map { case Array(s1,s2,s3,s4) => (s1,s2,s3,s3) }
.toDF().show()
问题显然是用单个元素创建一个列表,而该元素也是一个集合。我想解决方案是先创建一个空列表,然后添加单个元素。
与更新一样,这似乎是我们未处理元组的问题,这也许也可以解决:
val strs = Array[String]("id","value","group","ts")
spark.sparkContext.parallelize(List(strs))
.map { case Array(s1,s2,s3,s4) => (s1,s2,s3,s3) }
.toDF().show()
但是我不认为我们可以处理任意长度的数组,因为这将导致具有任意长度的元组...这没有意义,因为对于一个DataFrame,我们也要处理一个固定定义(列数和列类型)。如果确实发生这种情况,那么您将不得不用空白填充其余的元组项目,并使用最大的元组。
答案 1 :(得分:0)
我认为,这是关于how to transfer a List to Tuple的问题,我尝试了以下解决方案:
val strs = Array[String]("id","value","group","ts")
def listToTuple(list:Seq[Object]):Product = {
val clas = Class.forName("scala.Tuple" + list.size)
clas.getConstructors.apply(0).newInstance(list:_*).asInstanceOf[Product]
}
val aa = listToTuple(strs.toSeq)
我可以将数组或列表转换为元组,但是当我尝试将其转换为DataFrame时:
List(listToTuple(strs.toSeq)).toDF().show()
然后我得到一个例外:
Exception in thread "main" scala.ScalaReflectionException: <none> is not a term
at scala.reflect.api.Symbols$SymbolApi$class.asTerm(Symbols.scala:199)
at scala.reflect.internal.Symbols$SymbolContextApiImpl.asTerm(Symbols.scala:84)
at org.apache.spark.sql.catalyst.ScalaReflection$class.constructParams(ScalaReflection.scala:858)
at org.apache.spark.sql.catalyst.ScalaReflection$.constructParams(ScalaReflection.scala:39)
at org.apache.spark.sql.catalyst.ScalaReflection$class.getConstructorParameters(ScalaReflection.scala:839)
at org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:39)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:606)
at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:438)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233)
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33)