Question

我目前正在尝试学习在Scala中使用Apache Spark。

我有下表作为要用于分析的数据框

现在，我想遍历行，获取正文列中字符串的ID和字数，并在具有2列的数据框中输出信息。

def analyseDF(df:DataFrame): Unit = {
      //var retFrame = spark.emptyDataset[ClassIdCount].toDF()
      var tList = mutable.MutableList[IdCount]()

      df.foreach(row => {
        val wordCnt = row.getString(5).split(" ").size
        val mailid = row.getString(0)

        val record = IdCount(mailid.toString(), wordCnt.toInt)
        tList += record

        println(tList)
        println(record)

      })
      tList.toDF().show()
     // tList.toDS().show()

    }

以某种方式调用tList.toDF（）。show（）时，具有两列的框架始终为空，但是循环中的记录是正确生成的。有人可以在这里给我提示吗？

Answer 1

典型的初学者错误：tList仅存在于驱动程序上，无法从执行程序端代码进行更新。那不是从现有数据框创建数据框的方式。请改用转换/聚合。

在这种情况下，您可以使用内置的Dataframe API函数split和size：

import org.apache.spark.sql.functions._

val transformedDf = df
  .select(
      $"id",
      size(split($"body"," "))).as("cnt")
  )

Apache Spark：迭代数据帧的行，并通过MutableList（Scala）创建新的数据帧

1 个答案: