Question

出于某种原因，我必须将RDD转换为DataFrame，然后使用DataFrame执行某些操作。

我的界面是RDD，因此我必须将DataFrame转换为RDD，当我使用df.withcolumn时，分区会更改为1，所以我必须{ {1}}和repartition RDD。

有没有更清洁的解决方案？

这是我的代码：

sortBy

Answer 1

让我们尽可能简单，我们将生成相同的数据到4个分区

scala> val df = spark.range(1,9,1,4).toDF
df: org.apache.spark.sql.DataFrame = [id: bigint]

scala> df.show
+---+
| id|
+---+
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
+---+

scala> df.rdd.getNumPartitions
res13: Int = 4

我们不需要3个窗口函数来证明这一点，所以让我们用一个：

scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window

scala> val df2 = df.withColumn("csum", sum($"id").over(Window.orderBy($"id")))
df2: org.apache.spark.sql.DataFrame = [id: bigint, csum: bigint]

所以这里发生的事情是我们并没有添加一个列，但是我们计算了一个累积总和的窗口，因为你还没有提供一个分区列，窗口函数将所有数据移动到一个分区，您甚至可以从spark获得警告：

scala> df2.rdd.getNumPartitions
17/06/06 10:05:53 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
res14: Int = 1

scala> df2.show
17/06/06 10:05:56 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+---+----+
| id|csum|
+---+----+
|  1|   1|
|  2|   3|
|  3|   6|
|  4|  10|
|  5|  15|
|  6|  21|
|  7|  28|
|  8|  36|
+---+----+

因此，现在让我们添加一个要分区的列。我们将仅为了演示而创建一个新的DataFrame：

scala> val df3 = df.withColumn("x", when($"id"<5,lit("a")).otherwise("b"))
df3: org.apache.spark.sql.DataFrame = [id: bigint, x: string]

我们在df上明确定义了相同数量的分区：

scala> df3.rdd.getNumPartitions
res18: Int = 4

让我们使用列x进行分区操作窗口操作：

scala> val df4 = df3.withColumn("csum", sum($"id").over(Window.orderBy($"id").partitionBy($"x")))
df4: org.apache.spark.sql.DataFrame = [id: bigint, x: string ... 1 more field]

scala> df4.show
+---+---+----+                                                                  
| id|  x|csum|
+---+---+----+
|  5|  b|   5|
|  6|  b|  11|
|  7|  b|  18|
|  8|  b|  26|
|  1|  a|   1|
|  2|  a|   3|
|  3|  a|   6|
|  4|  a|  10|
+---+---+----+

窗口函数将使用spark配置中设置的默认分区数重新分区数据。

scala> df4.rdd.getNumPartitions
res20: Int = 200

Answer 2

我刚刚从https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-performance-tuning-groupBy-aggregation.html中了解了有关使用groupBy聚合时控制分区数量的信息，似乎对Window也是有效的，在我的代码中我定义了一个窗口，如

windowSpec = Window \
    .partitionBy('colA', 'colB') \
    .orderBy('timeCol') \
    .rowsBetween(1, 1)

然后做

next_event = F.lead('timeCol', 1).over(windowSpec)

并通过

创建数据框

df2 = df.withColumn('next_event', next_event)

，实际上，它有200个分区。但是，如果我这样做

df2 = df.repartition(10, 'colA', 'colB').withColumn('next_event', next_event)

它有10个！

如何使用DataFrame withColumn而不是更改分区？

2 个答案: