Question

当我使用// Azure container filesystem, it is contain source, destination, archive and result files val azureContainerFs = FileSystem.get(sc.hadoopConfiguration) // Read source file list val sourceFiles = azureContainerFs.listStatus(new Path("/"+sourcePath +"/"),new PathFilter { override def accept(path: Path): Boolean = { val name = path.getName name.endsWith(".json") } }).toList.par // Ingestion processing to each file for (sourceFile <- sourceFiles) { // Tokenize file name from path val sourceFileName = sourceFile.getPath.toString.substring(sourceFile.getPath.toString.lastIndexOf('/') + 1) // Create a customer invoice DF from source json val customerInvoiceDf = sqlContext.read.format("json").schema(schemaDf.schema).json("/"+sourcePath +"/"+sourceFileName).cache()时，我有一个RDD，RDD'partition结果更改为1，我使用window时是否可以不更改partition？这是我的代码：

window

我的输入'分区是4，我想我的结果'分区是4，有没有更清洁的解决方案？

Answer 1

RDD的分区是预期的1，这是因为您正在DataFrame上执行Window函数而没有partitionBy子句。因此，在这种情况下，所有数据都必须分组到一个分区中。

当我们在Window函数中包含partitionBy子句时，结果RDD中的分区数不再是1，如下所示。在下面的示例中，我们在原始数据框中包含了另一个名为col1的列，并在partitionBy列上应用了col1子句的相同窗口函数。

scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window

scala> val rdd = spark.sparkContext.parallelize(List((1,1),(3,1),(2,2),(4,2),(5,2),(6,3),(7,3),(8,3)),4)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[49] at parallelize at <console>:28

scala> val result = rdd.toDF("values", "col1").withColumn("csum", sum(col("values")).over(Window.partitionBy("col1").orderBy("values"))).rdd
result: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[58] at rdd at <console>:30

scala> result.getNumPartitions
res6: Int = 200

当我使用带有spark / scala的窗口时，我不能更改分区吗？

1 个答案: