Spark Dataframe Union提供重复项

时间:2018-06-19 00:23:05

标签: scala apache-spark

我有一个基础数据集,其中一列有空值而不是空值。 所以我这样做:

val nonTrained_ds = base_ds.filter(col("col_name").isNull)
val trained_ds = base_ds.filter(col("col_name").isNotNull)

当我打印出来时,我会清楚地分开行。但是当我这样做时,

val combined_ds = nonTrained_ds.union(trained_ds)

我从nonTrained_ds获得了行的重复记录,奇怪的是,来自trained_ds的行不再是合并后的ds。

为什么会这样?

trained_ds的值为:

+----------+----------------+
|unique_no |      running_id|
+----------+----------------+
|0456700001|16              |
|0456700004|16              |
|0456700007|16              |
|0456700010|16              |
|0456700013|16              |
|0456700016|16              |
|0456700019|16              |
|0456700022|16              |
|0456700025|16              |
|0456700028|16              |
|0456700031|16              |
|0456700034|16              |
|0456700037|16              |
|0456700040|16              |
|0456700043|16              |
|0456700046|16              |
|0456700049|16              |
|0456700052|16              |
|0456700055|16              |
|0456700058|16              |
|0456700061|16              |
|0456700064|16              |
|0456700067|16              |
|0456700070|16              |
+----------+----------------+

nonTrained_ds的值为:

+----------+----------------+
|unique_no |      running_id|
+----------+----------------+
|0456700002|null            |
|0456700003|null            |
|0456700005|null            |
|0456700006|null            |
|0456700008|null            |
|0456700009|null            |
|0456700011|null            |
|0456700012|null            |
|0456700014|null            |
|0456700015|null            |
|0456700017|null            |
|0456700018|null            |
|0456700020|null            |
|0456700021|null            |
|0456700023|null            |
|0456700024|null            |
|0456700026|null            |
|0456700027|null            |
|0456700029|null            |
|0456700030|null            |
|0456700032|null            |
|0456700033|null            |
|0456700035|null            |
|0456700036|null            |
|0456700038|null            |
|0456700039|null            |
|0456700041|null            |
|0456700042|null            |
|0456700044|null            |
|0456700045|null            |
|0456700047|null            |
|0456700048|null            |
|0456700050|null            |
|0456700051|null            |
|0456700053|null            |
|0456700054|null            |
|0456700056|null            |
|0456700057|null            |
|0456700059|null            |
|0456700060|null            |
|0456700062|null            |
|0456700063|null            |
|0456700065|null            |
|0456700066|null            |
|0456700068|null            |
|0456700069|null            |
|0456700071|null            |
|0456700072|null            |
+----------+----------------+

组合ds的值为:

+----------+----------------+
|unique_no |      running_id|
+----------+----------------+
|0456700002|null            |
|0456700003|null            |
|0456700005|null            |
|0456700006|null            |
|0456700008|null            |
|0456700009|null            |
|0456700011|null            |
|0456700012|null            |
|0456700014|null            |
|0456700015|null            |
|0456700017|null            |
|0456700018|null            |
|0456700020|null            |
|0456700021|null            |
|0456700023|null            |
|0456700024|null            |
|0456700026|null            |
|0456700027|null            |
|0456700029|null            |
|0456700030|null            |
|0456700032|null            |
|0456700033|null            |
|0456700035|null            |
|0456700036|null            |
|0456700038|null            |
|0456700039|null            |
|0456700041|null            |
|0456700042|null            |
|0456700044|null            |
|0456700045|null            |
|0456700047|null            |
|0456700048|null            |
|0456700050|null            |
|0456700051|null            |
|0456700053|null            |
|0456700054|null            |
|0456700056|null            |
|0456700057|null            |
|0456700059|null            |
|0456700060|null            |
|0456700062|null            |
|0456700063|null            |
|0456700065|null            |
|0456700066|null            |
|0456700068|null            |
|0456700069|null            |
|0456700071|null            |
|0456700072|null            |
|0456700002|16              |
|0456700005|16              |
|0456700008|16              |
|0456700011|16              |
|0456700014|16              |
|0456700017|16              |
|0456700020|16              |
|0456700023|16              |
|0456700026|16              |
|0456700029|16              |
|0456700032|16              |
|0456700035|16              |
|0456700038|16              |
|0456700041|16              |
|0456700044|16              |
|0456700047|16              |
|0456700050|16              |
|0456700053|16              |
|0456700056|16              |
|0456700059|16              |
|0456700062|16              |
|0456700065|16              |
|0456700068|16              |
|0456700071|16              |
+----------+----------------+

1 个答案:

答案 0 :(得分:0)

这可以解决问题,

val nonTrained_ds = base_ds.filter(col("primary_offer_id").isNull).distinct()
    val trained_ds = base_ds.filter(col("primary_offer_id").isNotNull).distinct()