Question

相对较新的Scala和Spark，希望能得到一些见解。

我收到镶木地板的数据。这些数据来自多个来源，结构如下：

/dataroot/businessdate=20170829/sourcesystem=storea/datasample.parquet:

| SourceSystem | BusinessDate | OrderDate  | (other columns)|
| StoreA       | 2017-08-29   | 2018-02-03 | ...            |
...

此数据由BusinessDate分区，然后由SourceSystem分区。我需要重新分区数据，以便它被OrderDate分区。我创建的spark作业需要能够在指定的业务日期运行，并且在生成输出时无法创建重复数据或从其他业务日期删除数据。

我的当前解决方案在我的输入中迭代OrderDates时停止。这就是我所拥有的：

val spark = SparkSession.builder.appName("Placeholder").master("local[*]").getOrCreate
val businessdate = args(0)
val inputPath = dataroot+"\\businessdate="+businessdate

val inputData = spark.read.parquet(inputDF)
val distinctOrderDates = inputDF.select(inputDF("OrderDate")).distinct
for(od <- distinctdates){
    val outputPath = outputRoot+"orderdate="+od
    //There's more code here to verify existence of output data, but this works for local development
    unionDF = if(Files.exist(Paths.get(sampleOutput))) { 
        //If output data exists, program stalls here
        val outputData = spark.read.parquet(outputPath)
        val filteredOutput = outputData.filter(not($"BusinessDate"===businessdate))
        val filteredInput = inputData.filter($"OrderDate"=== lit(od))
        filteredOutput.union(filteredInput) //assigned to unionDF
    }else{
        inputData.filter($"OrderDate"=== lit(od))
    }
    // Process stalls here
    unionDF.write.mode(SaveMode.Append).parquet(tempRoot)
}

然后将文件从临时目录复制到正确的输出目录中。

我没有收到任何错误消息或奇怪的INFO消息。当我运行该进程时，临时输出目录已创建但未填充（如果输出数据尚不存在）。否则，根据Spark UI，程序停止的位置取决于是否存在现有输出数据

此外，如果我在写入操作之前添加一行打印unionDF.count，则程序会在那里而不是在写入操作期间停止。这告诉我问题是unionDF正在评估，但我不确定如何修复它。

这对输入数据中只有一个OrderDate的数据非常有用。

理想情况下，我希望在输入数据上运行简单的partitionBy("OrderDate")，但这并不能解决避免重复数据或覆盖现有数据的问题。是否有更好的方法解决我遇到的这个问题，或者找到解决方案？

Answer 1

我通过将distinctOrderDates转换为列表来解决此问题。以我的方式迭代数据帧会导致问题，即使我在for循环中提取数据帧值并对其进行操作。我添加的代码是：

val distinctDateList = distinctDates.collect.map(_(0)).toList

我使用该列表而不是程序其余部分的不同日期数据帧。

根据列值将数据帧拆分为多个输出

1 个答案: