除方法问题

时间:2018-03-09 07:14:42

标签: scala apache-spark apache-spark-sql spark-dataframe

我有一个减去两个数据帧的用例。所以我使用了除()方法之外的数据帧。

这在较小的数据集上本地工作正常。

但是当我运行AWS S3存储桶时,except()方法没有按预期减少。在分布式环境中是否需要注意什么?

有没有人遇到过类似的问题?

这是我的示例代码

val values = List(List("One", "2017-07-01T23:59:59.000", "2017-11-04T23:59:58.000", "A", "Yes") 
  , List("Two", "2017-07-01T23:59:59.000", "2017-11-04T23:59:58.000", "X", "No") 
  , List("Three", "2017-07-09T23:59:59.000", "2017-12-05T23:59:58.000", "M", "Yes") 
  , List("Four", "2017-11-01T23:59:59.000", "2017-12-09T23:59:58.000", "A", "No") 
  , List("Five", "2017-07-09T23:59:59.000", "2017-12-05T23:59:58.000", "", "No") 
  ,List("One", "2017-07-01T23:59:59.000", "2017-11-04T23:59:58.000", "", "No")
)
  .map(row => (row(0), row(1), row(2), row(3), row(4)))

val spark = SparkSession.builder().master("local").getOrCreate()

import spark.implicits._

val df = values.toDF("KEY", "ROW_START_DATE", "ROW_END_DATE", "CODE", "Indicator")

val filterCond = (col("ROW_START_DATE") <= "2017-10-31T23:59:59.999" && col("ROW_END_DATE") >= "2017-10-31T23:59:59.999" && col("CODE").isin("M", "A", "R", "G"))


val Filtered = df.filter(filterCond)
val Excluded = df.except(df.filter(filterCond))

预期产出:

df.show(false)
Filtered.show(false)
Excluded.show(false)
+-----+-----------------------+-----------------------+----+---------+
|KEY  |ROW_START_DATE         |ROW_END_DATE           |CODE|Indicator|
+-----+-----------------------+-----------------------+----+---------+
|One  |2017-07-01T23:59:59.000|2017-11-04T23:59:58.000|A   |Yes      |
|Two  |2017-07-01T23:59:59.000|2017-11-04T23:59:58.000|X   |No       |
|Three|2017-07-09T23:59:59.000|2017-12-05T23:59:58.000|M   |Yes      |
|Four |2017-11-01T23:59:59.000|2017-12-09T23:59:58.000|A   |No       |
|Five |2017-07-09T23:59:59.000|2017-12-05T23:59:58.000|    |No       |
|One  |2017-07-01T23:59:59.000|2017-11-04T23:59:58.000|    |No       |
+-----+-----------------------+-----------------------+----+---------+
+-----+-----------------------+-----------------------+----+---------+
|KEY  |ROW_START_DATE         |ROW_END_DATE           |CODE|Indicator|
+-----+-----------------------+-----------------------+----+---------+
|One  |2017-07-01T23:59:59.000|2017-11-04T23:59:58.000|A   |Yes      |
|Three|2017-07-09T23:59:59.000|2017-12-05T23:59:58.000|M   |Yes      |
+-----+-----------------------+-----------------------+----+---------+
+----+-----------------------+-----------------------+----+---------+
|KEY |ROW_START_DATE         |ROW_END_DATE           |CODE|Indicator|
+----+-----------------------+-----------------------+----+---------+
|Four|2017-11-01T23:59:59.000|2017-12-09T23:59:58.000|A   |No       |
|Two |2017-07-01T23:59:59.000|2017-11-04T23:59:58.000|X   |No       |
|Five|2017-07-09T23:59:59.000|2017-12-05T23:59:58.000|    |No       |
|One |2017-07-01T23:59:59.000|2017-11-04T23:59:58.000|    |No       |
+----+-----------------------+-----------------------+----+---------+

但是在S3存储桶上运行时会得到类似下面的东西

Filtered.show(false)
+-----+-----------------------+-----------------------+----+---------+
|KEY  |ROW_START_DATE         |ROW_END_DATE           |CODE|Indicator|
+-----+-----------------------+-----------------------+----+---------+
|One  |2017-07-01T23:59:59.000|2017-11-04T23:59:58.000|A   |Yes      |
|Three|2017-07-09T23:59:59.000|2017-12-05T23:59:58.000|M   |Yes      |
+-----+-----------------------+-----------------------+----+---------+

Excluded.show(false)

+----+-----------------------+-----------------------+----+---------+
|KEY |ROW_START_DATE         |ROW_END_DATE           |CODE|Indicator|
+----+-----------------------+-----------------------+----+---------+
|One |2017-07-01T23:59:59.000|2017-11-04T23:59:58.000|A   |Yes      |---> wrong
|Four|2017-11-01T23:59:59.000|2017-12-09T23:59:58.000|A   |No       |
|Two |2017-07-01T23:59:59.000|2017-11-04T23:59:58.000|X   |No       |
|Five|2017-07-09T23:59:59.000|2017-12-05T23:59:58.000|    |No       |
|One |2017-07-01T23:59:59.000|2017-11-04T23:59:58.000|    |No       |
+----+-----------------------+-----------------------+----+---------+

还有其他方法可以执行减去两个火花数据帧吗?

2 个答案:

答案 0 :(得分:1)

S3不是一个文件系统,it can surface in spark

  1. 尝试验证写入s3的数据与使用file:// dest时获得的数据相同。因为在途中会有丢失的风险。
  2. 然后尝试在写入s3和阅读之间放一个Thread.sleep(10000);这将显示目录不一致是否出现。
  3. 如果您使用的是EMR,请尝试使用其一致的EMR选项
  4. 尝试使用s3a://连接器
  5. 如果它不能与s3a://在issues.apache.org上提交SPARK-JIRA,请将s3a放在文本中,包括此代码段(隐式将其许可给ASF)。然后,我可以将其复制到测试和测试中。看看我是否能看到它,如果是的话,当我在Hadoop 3.1 +中打开s3guard时它是否会消失

答案 1 :(得分:1)

一个可以基于两个数据帧的唯一性在两个数据帧上使用leftanti连接,这将为您提供期望的除外操作输出。

val diffdf = df1.join(df2,Seq("uniquekey"),"leftanti")