我有一个以下数据集,
+---------+---------+----------+-----------+-----------+-----------+
| Column1 | Column2 | Column3 | Exspense1 | Exspense2 | Exspense3 |
+---------+---------+----------+-----------+-----------+-----------+
| null | null | null | 175935.40 | 2557400 | 0 |
| null | null | 20160511 | 94598.40 | 13050360 | 0 |
| null | null | 20160512 | 81337.00 | 12523645 | 0 |
| null | Item1 | null | 24955.20 | 4206475 | 0 |
| null | Item1 | 20160511 | 14143.30 | 2357534 | 0 |
| null | Item1 | 20160512 | 10811.90 | 1848941 | 0 |
| null | Item2 | null | 26725.20 | 2188031 | 0 |
| null | Item2 | 20160511 | 17807.50 | 1400011 | 0 |
| null | Item2 | 20160512 | 8917.70 | 788020 | 0 |
| null | Item3 | null | 19234.30 | 2787529 | 0 |
| null | Item3 | 20160511 | 8204.30 | 1162487 | 0 |
| null | Item3 | 20160512 | 11030.00 | 1625042 | 0 |
| null | Item4 | null | 85239.20 | 13848186 | 0 |
| null | Item4 | 20160511 | 47324.10 | 7157838 | 0 |
| null | Item4 | 20160512 | 37915.10 | 6690348 | 0 |
| null | Item5 | null | 19781.50 | 2543784 | 0 |
| null | Item5 | 20160511 | 7119.209 | 72490 | 0 |
| null | Item5 | 20160512 | 12662.30 | 1571294 | 0 |
| Shop1 | null | null | 35.70 | 10577 | 0 |
| Shop1 | null | 20160512 | 35.701 | 0577 | 0 |
| Shop1 | Item1 | null | 34.40 | 10538 | 0 |
| Shop1 | Item1 | 20160512 | 34.401 | 0538 | 0 |
| Shop1 | Item3 | null | 1.30 | 39 | 0 |
| Shop1 | Item3 | 20160512 | 1.30 | 39 | 0 |
| Shop2 | null | null | 10757.30 | 2163921 | 0 |
| Shop2 | null | 20160511 | 6672.20 | 1286947 | 0 |
| Shop2 | null | 20160512 | 4085.10 | 876974 | 0 |
| Shop2 | Item1 | null | 1510.30 | 370818 | 0 |
| Shop2 | Item1 | 20160511 | 752.101 | 90052 | 0 |
| Shop2 | Item1 | 20160512 | 758.201 | 80766 | 0 |
+---------+---------+----------+-----------+-----------+-----------+
我正在跟踪检查foreg:boolean sumCheck
对于下面的每一列,
我必须遍历每一列。现在,
1.for Column1 if sumCheck is true
我必须过滤Column1不为空且同一行prevoius列为null的行,因为Column1是第一列所以没有过滤器,
(Column2 is not null and Column1 is null)
我必须到达下面,
<table><tbody><tr><th>Column1</th><th>Column2</th><th>Column3</th><th>Exspense1</th><th>Exspense2</th><th>Exspense3</th></tr><tr><td>null</td><td>null</td><td>null</td><td>175935.40</td><td>2557400</td><td>0</td></tr><tr><td>null</td><td>null</td><td>20160511</td><td>94598.40</td><td>13050360</td><td>0</td></tr><tr><td>null</td><td>null</td><td>20160512</td><td>81337.00</td><td>12523645</td><td>0</td></tr><tr><td>Shop1</td><td>null</td><td>null</td><td>35.70</td><td>10577</td><td>0</td></tr><tr><td>Shop1</td><td>null</td><td>20160512</td><td>35.701</td><td>0577</td><td>0</td></tr><tr><td>Shop1</td><td>Item1</td><td>null</td><td>34.40</td><td>10538</td><td>0</td></tr><tr><td>Shop1</td><td>Item1</td><td>20160512</td><td>34.401</td><td>0538</td><td>0</td></tr><tr><td>Shop1</td><td>Item3</td><td>null</td><td>1.30</td><td>39</td><td>0</td></tr><tr><td>Shop1</td><td>Item3</td><td>20160512</td><td>1.30</td><td>39</td><td>0</td></tr><tr><td>Shop2</td><td>null</td><td>null</td><td>10757.30</td><td>2163921</td><td>0</td></tr><tr><td>Shop2</td><td>null</td><td>20160511</td><td>6672.20</td><td>1286947</td><td>0</td></tr><tr><td>Shop2</td><td>null</td><td>20160512</td><td>4085.10</td><td>876974</td><td>0</td></tr><tr><td>Shop2</td><td>Item1</td><td>null</td><td>1510.30</td><td>370818</td><td>0</td></tr><tr><td>Shop2</td><td>Item1</td><td>20160511</td><td>752.101</td><td>90052</td><td>0</td></tr><tr><td>Shop2</td><td>Item1</td><td>20160512</td><td>758.201</td><td>80766</td><td>0</td></tr></tbody></table>
<table><tbody><tr><th>Column1</th><th>Column2</th><th>Column3</th><th>Exspense1</th><th>Exspense2</th><th>Exspense3</th></tr><tr><td>null</td><td>null</td><td>null</td><td>175935.40</td><td>2557400</td><td>0</td></tr><tr><td>Shop1</td><td>null</td><td>null</td><td>35.70</td><td>10577</td><td>0</td></tr><tr><td>Shop1</td><td>Item1</td><td>null</td><td>34.40</td><td>10538</td><td>0</td></tr><tr><td>Shop1</td><td>Item1</td><td>20160512</td><td>34.401</td><td>0538</td><td>0</td></tr><tr><td>Shop1</td><td>Item3</td><td>null</td><td>1.30</td><td>39</td><td>0</td></tr><tr><td>Shop1</td><td>Item3</td><td>20160512</td><td>1.30</td><td>39</td><td>0</td></tr><tr><td>Shop2</td><td>null</td><td>null</td><td>10757.30</td><td>2163921</td><td>0</td></tr><tr><td>Shop2</td><td>Item1</td><td>null</td><td>1510.30</td><td>370818</td><td>0</td></tr><tr><td>Shop2</td><td>Item1</td><td>20160511</td><td>752.101</td><td>90052</td><td>0</td></tr><tr><td>Shop2</td><td>Item1</td><td>20160512</td><td>758.201</td><td>80766</td><td>0</td></tr></tbody></table>
我目前正在执行以下步骤:
为每个列大小I循环并查看标志; 我从第二栏开始: 对于第二集团:
val exceptDf=dataset.filter("Column2 is not null and Column 1 is null");
for Third Col:
val exceptDf3=exceptDf.union(dataset.filter("Column3 is not null and Column 2 is null"));
最后我做
dataset.except(exceptDf3);
由于我正在使用union except filter
,我只是想看看是否有任何方法或filter
只会避免我使用unions
和exept
函数。
请帮助我获得理想的结果。
答案 0 :(得分:0)
您可以使用spark where
或filter
功能。
对于样本数据集:
+----+---+----+---+
| c1| c2| c3| c4|
+----+---+----+---+
| 2.2|v21| 1|foo|
|null|v22| 2|bar|
| 4.4|v23| 3|baz|
| 5.5|v24|null|foo|
+----+---+----+---+
我必须检查条件:c2!= null和c1 == null和c4!= null和c3 == null:
使用where
:
df.where("(not(c2 is not null and c1 is null)) and (not(c4 is not null and c3 is null))")
使用filter
:
df.filter( !(col("c2").isNotNull && col("c1").isNull) && !(col("c4").isNotNull && col("c3").isNull) )
输出:
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
|2.2|v21| 1|foo|
|4.4|v23| 3|baz|
+---+---+---+---+