当我使用// Azure container filesystem, it is contain source, destination, archive and result files
val azureContainerFs = FileSystem.get(sc.hadoopConfiguration)
// Read source file list
val sourceFiles = azureContainerFs.listStatus(new Path("/"+sourcePath +"/"),new PathFilter {
override def accept(path: Path): Boolean = {
val name = path.getName
name.endsWith(".json")
}
}).toList.par
// Ingestion processing to each file
for (sourceFile <- sourceFiles) {
// Tokenize file name from path
val sourceFileName = sourceFile.getPath.toString.substring(sourceFile.getPath.toString.lastIndexOf('/') + 1)
// Create a customer invoice DF from source json
val customerInvoiceDf = sqlContext.read.format("json").schema(schemaDf.schema).json("/"+sourcePath +"/"+sourceFileName).cache()
时,我有一个RDD
,RDD'partition
结果更改为1,我使用window
时是否可以不更改partition
?
这是我的代码:
window
我的输入'分区是4,我想我的结果'分区是4,有没有更清洁的解决方案?
答案 0 :(得分:0)
RDD的分区是预期的1,这是因为您正在DataFrame上执行Window函数而没有partitionBy子句。因此,在这种情况下,所有数据都必须分组到一个分区中。
当我们在Window函数中包含partitionBy子句时,结果RDD中的分区数不再是1,如下所示。在下面的示例中,我们在原始数据框中包含了另一个名为col1
的列,并在partitionBy
列上应用了col1
子句的相同窗口函数。
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val rdd = spark.sparkContext.parallelize(List((1,1),(3,1),(2,2),(4,2),(5,2),(6,3),(7,3),(8,3)),4)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[49] at parallelize at <console>:28
scala> val result = rdd.toDF("values", "col1").withColumn("csum", sum(col("values")).over(Window.partitionBy("col1").orderBy("values"))).rdd
result: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[58] at rdd at <console>:30
scala> result.getNumPartitions
res6: Int = 200