Question

我正在尝试使用collect函数将数据框作为记录列表，并且对于具有4000多个列的数据框来说非常慢。有没有更快的选择？我什至尝试在调用.collect（）之前先做df.persist（），但这仍然无济于事。

val data = df
  .collect()
  .map(
    x ⇒
      x.toSeq.toList.map(_ match {
        case null  ⇒ ""
        case other ⇒ other.toString
      })
  )
  .toList

编辑（根据评论）：

因此，用例是从数据框中获取记录并将其显示为样本数据。

Answer 1

根据您的问题和评论，听起来您正在寻找一种对列和行进行采样的方法。这是获取N个随机列并随机sample DataFrame中行的一部分的简单方法：

val df = Seq(
  (1, "a", 10.0, 100L),
  (2, "b", 20.0, 200L),
  (3, "c", 30.0, 300L)
).toDF("c1", "c2", "c3", "c4")

import scala.util.Random

// e.g. Take 3 random columns and randomly pick ~70% of rows
df.
  select(Random.shuffle(df.columns.toSeq).take(3).map(col): _*).
  sample(70.0/100).
  show
// +---+---+---+
// | c1| c2| c4|
// +---+---+---+
// |  1|  a|100|
// |  3|  c|300|
// +---+---+---+

Answer 2

您应该将获取的行数限制为驱动程序，collect将获得所有内容。

可以使用

df.limit(20).collect

或

df.take(20)

此外，如果您先将Row映射到List[String]，然后再收集，那我会更快。这样，此操作将在执行程序上运行：

val data = df
  .map(
    x ⇒
      x.toSeq.toList.map(_ match {
        case null  ⇒ ""
        case other ⇒ other.toString
      })
  )
  .take(20)
  .toList

如何对DataFrame中的一部分行进行随机采样？

2 个答案: