我有一个基础数据集,其中一列有空值而不是空值。 所以我这样做:
val nonTrained_ds = base_ds.filter(col("col_name").isNull)
val trained_ds = base_ds.filter(col("col_name").isNotNull)
当我打印出来时,我会清楚地分开行。但是当我这样做时,
val combined_ds = nonTrained_ds.union(trained_ds)
我从nonTrained_ds
获得了行的重复记录,奇怪的是,来自trained_ds
的行不再是合并后的ds。
为什么会这样?
trained_ds
的值为:
+----------+----------------+
|unique_no | running_id|
+----------+----------------+
|0456700001|16 |
|0456700004|16 |
|0456700007|16 |
|0456700010|16 |
|0456700013|16 |
|0456700016|16 |
|0456700019|16 |
|0456700022|16 |
|0456700025|16 |
|0456700028|16 |
|0456700031|16 |
|0456700034|16 |
|0456700037|16 |
|0456700040|16 |
|0456700043|16 |
|0456700046|16 |
|0456700049|16 |
|0456700052|16 |
|0456700055|16 |
|0456700058|16 |
|0456700061|16 |
|0456700064|16 |
|0456700067|16 |
|0456700070|16 |
+----------+----------------+
nonTrained_ds
的值为:
+----------+----------------+
|unique_no | running_id|
+----------+----------------+
|0456700002|null |
|0456700003|null |
|0456700005|null |
|0456700006|null |
|0456700008|null |
|0456700009|null |
|0456700011|null |
|0456700012|null |
|0456700014|null |
|0456700015|null |
|0456700017|null |
|0456700018|null |
|0456700020|null |
|0456700021|null |
|0456700023|null |
|0456700024|null |
|0456700026|null |
|0456700027|null |
|0456700029|null |
|0456700030|null |
|0456700032|null |
|0456700033|null |
|0456700035|null |
|0456700036|null |
|0456700038|null |
|0456700039|null |
|0456700041|null |
|0456700042|null |
|0456700044|null |
|0456700045|null |
|0456700047|null |
|0456700048|null |
|0456700050|null |
|0456700051|null |
|0456700053|null |
|0456700054|null |
|0456700056|null |
|0456700057|null |
|0456700059|null |
|0456700060|null |
|0456700062|null |
|0456700063|null |
|0456700065|null |
|0456700066|null |
|0456700068|null |
|0456700069|null |
|0456700071|null |
|0456700072|null |
+----------+----------------+
组合ds的值为:
+----------+----------------+
|unique_no | running_id|
+----------+----------------+
|0456700002|null |
|0456700003|null |
|0456700005|null |
|0456700006|null |
|0456700008|null |
|0456700009|null |
|0456700011|null |
|0456700012|null |
|0456700014|null |
|0456700015|null |
|0456700017|null |
|0456700018|null |
|0456700020|null |
|0456700021|null |
|0456700023|null |
|0456700024|null |
|0456700026|null |
|0456700027|null |
|0456700029|null |
|0456700030|null |
|0456700032|null |
|0456700033|null |
|0456700035|null |
|0456700036|null |
|0456700038|null |
|0456700039|null |
|0456700041|null |
|0456700042|null |
|0456700044|null |
|0456700045|null |
|0456700047|null |
|0456700048|null |
|0456700050|null |
|0456700051|null |
|0456700053|null |
|0456700054|null |
|0456700056|null |
|0456700057|null |
|0456700059|null |
|0456700060|null |
|0456700062|null |
|0456700063|null |
|0456700065|null |
|0456700066|null |
|0456700068|null |
|0456700069|null |
|0456700071|null |
|0456700072|null |
|0456700002|16 |
|0456700005|16 |
|0456700008|16 |
|0456700011|16 |
|0456700014|16 |
|0456700017|16 |
|0456700020|16 |
|0456700023|16 |
|0456700026|16 |
|0456700029|16 |
|0456700032|16 |
|0456700035|16 |
|0456700038|16 |
|0456700041|16 |
|0456700044|16 |
|0456700047|16 |
|0456700050|16 |
|0456700053|16 |
|0456700056|16 |
|0456700059|16 |
|0456700062|16 |
|0456700065|16 |
|0456700068|16 |
|0456700071|16 |
+----------+----------------+
答案 0 :(得分:0)
这可以解决问题,
val nonTrained_ds = base_ds.filter(col("primary_offer_id").isNull).distinct()
val trained_ds = base_ds.filter(col("primary_offer_id").isNotNull).distinct()