Spark Java DataFrame根据列进行求和和删除

时间:2018-06-06 06:33:56

标签: java apache-spark apache-spark-sql

我有一个spark dataFrame,如下所示:

INPUT

+----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+------+---------+------+--------+----------+----------+
| accountId|accountNumber|acctNumberTypeCode|cisDivision|currencyCode|priceItemCd|priceItemParam|priceItemParamCode|processingDate|txnAmt|  txnDttm|txnVol|udfChar1|  udfChar2|  udfChar3|
+----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+------+---------+------+--------+----------+----------+
|2032000000|   2032000000|          C1_F_ANO|         CA|         USD| PRICEITEM2|            UK|           Country|    2018-06-06|   100|28-MAY-18|   100|   TYPE1|PRICEITEM1|PRICEITEM2|
|2032000000|   2032000000|          C1_F_ANO|         CA|         USD| PRICEITEM2|            UK|           Country|    2018-06-06|   100|28-MAY-18|   100|   TYPE1|PRICEITEM1|PRICEITEM2|
|1322000000|   1322000000|          C1_F_ANO|         CA|         USD| PRICEITEM1|            US|           Country|    2018-06-06|   100|28-MAY-18|   100|   TYPE1|PRICEITEM1|PRICEITEM2|
|1322000000|   1322000000|          C1_F_ANO|         CA|         USD| PRICEITEM1|            US|           Country|    2018-06-06|   100|28-MAY-18|   100|   TYPE1|PRICEITEM1|PRICEITEM2|

现在我想表演,

  1. " txnAmt"具有相同accountId和帐号的记录的列。
  2. 删除重复记录。
  3. 输出

    +----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+------+---------+------+--------+----------+----------+
    | accountId|accountNumber|acctNumberTypeCode|cisDivision|currencyCode|priceItemCd|priceItemParam|priceItemParamCode|processingDate|txnAmt|  txnDttm|txnVol|udfChar1|  udfChar2|  udfChar3|
    +----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+------+---------+------+--------+----------+----------+
    |2032000000|   2032000000|          C1_F_ANO|         CA|         USD| PRICEITEM2|            UK|           Country|    2018-06-06|   200|28-MAY-18|   100|   TYPE1|PRICEITEM1|PRICEITEM2|
    |1322000000|   1322000000|          C1_F_ANO|         CA|         USD| PRICEITEM1|            US|           Country|    2018-06-06|   200|28-MAY-18|   100|   TYPE1|PRICEITEM1|PRICEITEM2|
    

    我不确定如何执行第1步?

    我已编写代码执行第2步,根据accountId和帐号删除重复项:

    String[] colNames = {"accountId", "accountNumber"};
    Dataset<RuleOutputParams> finalDs = rulesParamDS.dropDuplicates(colNames);
    

    有人可以帮忙吗?

1 个答案:

答案 0 :(得分:1)

加载数据并为其创建SQL表

val df = spark.read.format("csv").option("header", true).load("data.csv")
df.createOrReplaceTempView("t")

然后,您需要的内容称为Window Aggregation functions,加上row_number()删除重复项的技巧

val df2 = spark.sql("""SELECT * FROM (
  SELECT *, 
    sum(txnAmt) OVER (PARTITION BY accountId, accountNumber) s, 
    row_number() OVER (PARTITION BY accountId, accountNumber ORDER BY processingDate) r FROM t) 
  WHERE r=1""")
  .drop("txnAmt", "r")
  .withColumnRenamed("s", "txnAmt")

如果你证明了这一点,你就会看到

+----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+---------+------+--------+----------+----------+------+
| accountId|accountNumber|acctNumberTypeCode|cisDivision|currencyCode|priceItemCd|priceItemParam|priceItemParamCode|processingDate|  txnDttm|txnVol|udfChar1|  udfChar2|  udfChar3|txnAmt|
+----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+---------+------+--------+----------+----------+------+
|2032000000|   2032000000|          C1_F_ANO|         CA|         USD| PRICEITEM2|            UK|           Country|    2018-06-06|28-MAY-18|   100|   TYPE1|PRICEITEM1|PRICEITEM2| 200.0|
|1322000000|   1322000000|          C1_F_ANO|         CA|         USD| PRICEITEM1|            US|           Country|    2018-06-06|28-MAY-18|   100|   TYPE1|PRICEITEM1|PRICEITEM2| 200.0|
+----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+---------+------+--------+----------+----------+------+

作为旁注,可能会尝试在下面添加更多列,但您需要将它们添加到group by子句

spark.sql("SELECT accountId, accountNumber, SUM(txnAmt) txnAmt FROM t GROUP BY accountId, accountNumber").show
+----------+-------------+------+
| accountId|accountNumber|txnAmt|
+----------+-------------+------+
|2032000000|   2032000000| 200.0|
|1322000000|   1322000000| 200.0|
+----------+-------------+------+