Question

我有一个Spark RDD（或Dataframe - 转换为任何一个都不是问题），它有以下列（每个结构的示例）：

res248: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[1004] at map at <console>:246
org.apache.spark.sql.DataFrame = [id: string, list: array<string>]

我想扩展此RDD / DF以包含一个包含列表数组大小的附加列。所以输出应该是这样的（例子）：

org.apache.spark.sql.DataFrame = [id: string, list: array<string>, length_of_list: int]

我尝试了rdd.map(x=> (x._1,x._2,count(x._2)))，但收到了错误消息：

<console>:246: error: overloaded method value count with alternatives:
  (columnName: String)org.apache.spark.sql.TypedColumn[Any,Long] <and>
  (e: org.apache.spark.sql.Column)org.apache.spark.sql.Column

尝试使用功能为withColumn("new_column",count($"list"))的DF或其任何变体添加新列。它仍然无法运作。我收到一条抱怨聚合的错误消息。

您是否知道如何在不收集RDD的情况下实现这一目标？

Answer 1

您可以使用简单的UDF创建新列，以应用于列list，如下所示：

val df = Seq(
  ("a", Array("x1", "x2", "x3")),
  ("b", Array("y1", "y2", "y3", "y4"))
).toDF(
  "id", "list"
)
// df: org.apache.spark.sql.DataFrame = [id: string, list: array<string>]

val listSize = (l: Seq[String]) => l.size
// listSize: Seq[String] => Int = <function1>

def listSizeUDF = udf(listSize)
// listSizeUDF: org.apache.spark.sql.expressions.UserDefinedFunction

val df2 = df.withColumn("length_of_list", listSizeUDF($"list"))

df2.show
+---+----------------+--------------+
| id|            list|length_of_list|
+---+----------------+--------------+
|  a|    [x1, x2, x3]|             3|
|  b|[y1, y2, y3, y4]|             4|
+---+----------------+--------------+

[UPDATE]

正如@Ramesh Maharjan所指出的，Spark中有一个内置的size函数，我有点忘了。我将把旧的答案留作使用UDF的简单用例。

Answer 2

有内置函数size，它返回数组或映射的长度。

import org.apache.spark.sql.functions._
df.withColumn("length_of_list", size($"list"))

Spark Dataframe / RDD无法通过计算另一列

2 个答案: