火花的条件

时间:2018-03-08 06:57:48

标签: apache-spark apache-spark-sql apache-spark-dataset

我有一个以下数据集,

+---------+---------+----------+-----------+-----------+-----------+
| Column1 | Column2 | Column3  | Exspense1 | Exspense2 | Exspense3 |
+---------+---------+----------+-----------+-----------+-----------+
| null    | null    | null     | 175935.40 |   2557400 |         0 |
| null    | null    | 20160511 | 94598.40  |  13050360 |         0 |
| null    | null    | 20160512 | 81337.00  |  12523645 |         0 |
| null    | Item1   | null     | 24955.20  |   4206475 |         0 |
| null    | Item1   | 20160511 | 14143.30  |   2357534 |         0 |
| null    | Item1   | 20160512 | 10811.90  |   1848941 |         0 |
| null    | Item2   | null     | 26725.20  |   2188031 |         0 |
| null    | Item2   | 20160511 | 17807.50  |   1400011 |         0 |
| null    | Item2   | 20160512 | 8917.70   |    788020 |         0 |
| null    | Item3   | null     | 19234.30  |   2787529 |         0 |
| null    | Item3   | 20160511 | 8204.30   |   1162487 |         0 |
| null    | Item3   | 20160512 | 11030.00  |   1625042 |         0 |
| null    | Item4   | null     | 85239.20  |  13848186 |         0 |
| null    | Item4   | 20160511 | 47324.10  |   7157838 |         0 |
| null    | Item4   | 20160512 | 37915.10  |   6690348 |         0 |
| null    | Item5   | null     | 19781.50  |   2543784 |         0 |
| null    | Item5   | 20160511 | 7119.209  |     72490 |         0 |
| null    | Item5   | 20160512 | 12662.30  |   1571294 |         0 |
| Shop1   | null    | null     | 35.70     |     10577 |         0 |
| Shop1   | null    | 20160512 | 35.701    |      0577 |         0 |
| Shop1   | Item1   | null     | 34.40     |     10538 |         0 |
| Shop1   | Item1   | 20160512 | 34.401    |      0538 |         0 |
| Shop1   | Item3   | null     | 1.30      |        39 |         0 |
| Shop1   | Item3   | 20160512 | 1.30      |        39 |         0 |
| Shop2   | null    | null     | 10757.30  |   2163921 |         0 |
| Shop2   | null    | 20160511 | 6672.20   |   1286947 |         0 |
| Shop2   | null    | 20160512 | 4085.10   |    876974 |         0 |
| Shop2   | Item1   | null     | 1510.30   |    370818 |         0 |
| Shop2   | Item1   | 20160511 | 752.101   |     90052 |         0 |
| Shop2   | Item1   | 20160512 | 758.201   |     80766 |         0 |
+---------+---------+----------+-----------+-----------+-----------+

我正在跟踪检查foreg:boolean sumCheck对于下面的每一列, 我必须遍历每一列。现在,

1.for Column1 if sumCheck is true我必须过滤Column1不为空且同一行prevoius列为null的行,因为Column1是第一列所以没有过滤器,

  1. 第2栏: 如果检查属实, 然后我必须过滤 Column2不为空的行 Column1为空 这意味着我不希望行(Column2 is not null and Column1 is null) 我必须到达下面,
  2. <table><tbody><tr><th>Column1</th><th>Column2</th><th>Column3</th><th>Exspense1</th><th>Exspense2</th><th>Exspense3</th></tr><tr><td>null</td><td>null</td><td>null</td><td>175935.40</td><td>2557400</td><td>0</td></tr><tr><td>null</td><td>null</td><td>20160511</td><td>94598.40</td><td>13050360</td><td>0</td></tr><tr><td>null</td><td>null</td><td>20160512</td><td>81337.00</td><td>12523645</td><td>0</td></tr><tr><td>Shop1</td><td>null</td><td>null</td><td>35.70</td><td>10577</td><td>0</td></tr><tr><td>Shop1</td><td>null</td><td>20160512</td><td>35.701</td><td>0577</td><td>0</td></tr><tr><td>Shop1</td><td>Item1</td><td>null</td><td>34.40</td><td>10538</td><td>0</td></tr><tr><td>Shop1</td><td>Item1</td><td>20160512</td><td>34.401</td><td>0538</td><td>0</td></tr><tr><td>Shop1</td><td>Item3</td><td>null</td><td>1.30</td><td>39</td><td>0</td></tr><tr><td>Shop1</td><td>Item3</td><td>20160512</td><td>1.30</td><td>39</td><td>0</td></tr><tr><td>Shop2</td><td>null</td><td>null</td><td>10757.30</td><td>2163921</td><td>0</td></tr><tr><td>Shop2</td><td>null</td><td>20160511</td><td>6672.20</td><td>1286947</td><td>0</td></tr><tr><td>Shop2</td><td>null</td><td>20160512</td><td>4085.10</td><td>876974</td><td>0</td></tr><tr><td>Shop2</td><td>Item1</td><td>null</td><td>1510.30</td><td>370818</td><td>0</td></tr><tr><td>Shop2</td><td>Item1</td><td>20160511</td><td>752.101</td><td>90052</td><td>0</td></tr><tr><td>Shop2</td><td>Item1</td><td>20160512</td><td>758.201</td><td>80766</td><td>0</td></tr></tbody></table>

    1. 对于第3列,如果check为true,我必须过滤数据集,以便Column3在哪里 我必须删除Column3不为null且Column2为null的行; 所以我得到了,
    2. <table><tbody><tr><th>Column1</th><th>Column2</th><th>Column3</th><th>Exspense1</th><th>Exspense2</th><th>Exspense3</th></tr><tr><td>null</td><td>null</td><td>null</td><td>175935.40</td><td>2557400</td><td>0</td></tr><tr><td>Shop1</td><td>null</td><td>null</td><td>35.70</td><td>10577</td><td>0</td></tr><tr><td>Shop1</td><td>Item1</td><td>null</td><td>34.40</td><td>10538</td><td>0</td></tr><tr><td>Shop1</td><td>Item1</td><td>20160512</td><td>34.401</td><td>0538</td><td>0</td></tr><tr><td>Shop1</td><td>Item3</td><td>null</td><td>1.30</td><td>39</td><td>0</td></tr><tr><td>Shop1</td><td>Item3</td><td>20160512</td><td>1.30</td><td>39</td><td>0</td></tr><tr><td>Shop2</td><td>null</td><td>null</td><td>10757.30</td><td>2163921</td><td>0</td></tr><tr><td>Shop2</td><td>Item1</td><td>null</td><td>1510.30</td><td>370818</td><td>0</td></tr><tr><td>Shop2</td><td>Item1</td><td>20160511</td><td>752.101</td><td>90052</td><td>0</td></tr><tr><td>Shop2</td><td>Item1</td><td>20160512</td><td>758.201</td><td>80766</td><td>0</td></tr></tbody></table>

      我目前正在执行以下步骤:

      为每个列大小I循环并查看标志; 我从第二栏开始: 对于第二集团:

      val exceptDf=dataset.filter("Column2 is not null and Column 1 is null");
      

      for Third Col:

      val  exceptDf3=exceptDf.union(dataset.filter("Column3 is not null and Column 2 is null"));
      

      最后我做

      dataset.except(exceptDf3);
      

      由于我正在使用union except filter,我只是想看看是否有任何方法或filter只会避免我使用unionsexept函数。

      请帮助我获得理想的结果。

1 个答案:

答案 0 :(得分:0)

您可以使用spark wherefilter功能。

对于样本数据集:

+----+---+----+---+
|  c1| c2|  c3| c4|
+----+---+----+---+
| 2.2|v21|   1|foo|
|null|v22|   2|bar|
| 4.4|v23|   3|baz|
| 5.5|v24|null|foo|
+----+---+----+---+

我必须检查条件:c2!= null和c1 == null和c4!= null和c3 == null:

使用where

df.where("(not(c2 is not null and c1 is null)) and (not(c4 is not null and c3 is null))")

使用filter

df.filter( !(col("c2").isNotNull && col("c1").isNull) && !(col("c4").isNotNull && col("c3").isNull) )

输出:

+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
|2.2|v21|  1|foo|
|4.4|v23|  3|baz|
+---+---+---+---+