当我使用带有spark / scala的窗口时,我不能更改分区吗?

时间:2017-06-07 04:39:32

标签: scala apache-spark apache-spark-sql

当我使用// Azure container filesystem, it is contain source, destination, archive and result files val azureContainerFs = FileSystem.get(sc.hadoopConfiguration) // Read source file list val sourceFiles = azureContainerFs.listStatus(new Path("/"+sourcePath +"/"),new PathFilter { override def accept(path: Path): Boolean = { val name = path.getName name.endsWith(".json") } }).toList.par // Ingestion processing to each file for (sourceFile <- sourceFiles) { // Tokenize file name from path val sourceFileName = sourceFile.getPath.toString.substring(sourceFile.getPath.toString.lastIndexOf('/') + 1) // Create a customer invoice DF from source json val customerInvoiceDf = sqlContext.read.format("json").schema(schemaDf.schema).json("/"+sourcePath +"/"+sourceFileName).cache() 时,我有一个RDDRDD'partition结果更改为1,我使用window时是否可以不更改partition? 这是我的代码:

window

我的输入'分区是4,我想我的结果'分区是4,有没有更清洁的解决方案?

1 个答案:

答案 0 :(得分:0)

RDD的分区是预期的1,这是因为您正在DataFrame上执行Window函数而没有partitionBy子句。因此,在这种情况下,所有数据都必须分组到一个分区中。

当我们在Window函数中包含partitionBy子句时,结果RDD中的分区数不再是1,如下所示。在下面的示例中,我们在原始数据框中包含了另一个名为col1的列,并在partitionBy列上应用了col1子句的相同窗口函数。

scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window

scala> val rdd = spark.sparkContext.parallelize(List((1,1),(3,1),(2,2),(4,2),(5,2),(6,3),(7,3),(8,3)),4)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[49] at parallelize at <console>:28

scala> val result = rdd.toDF("values", "col1").withColumn("csum", sum(col("values")).over(Window.partitionBy("col1").orderBy("values"))).rdd
result: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[58] at rdd at <console>:30

scala> result.getNumPartitions
res6: Int = 200