我有一个spark dataFrame,如下所示:
INPUT
+----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+------+---------+------+--------+----------+----------+
| accountId|accountNumber|acctNumberTypeCode|cisDivision|currencyCode|priceItemCd|priceItemParam|priceItemParamCode|processingDate|txnAmt| txnDttm|txnVol|udfChar1| udfChar2| udfChar3|
+----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+------+---------+------+--------+----------+----------+
|2032000000| 2032000000| C1_F_ANO| CA| USD| PRICEITEM2| UK| Country| 2018-06-06| 100|28-MAY-18| 100| TYPE1|PRICEITEM1|PRICEITEM2|
|2032000000| 2032000000| C1_F_ANO| CA| USD| PRICEITEM2| UK| Country| 2018-06-06| 100|28-MAY-18| 100| TYPE1|PRICEITEM1|PRICEITEM2|
|1322000000| 1322000000| C1_F_ANO| CA| USD| PRICEITEM1| US| Country| 2018-06-06| 100|28-MAY-18| 100| TYPE1|PRICEITEM1|PRICEITEM2|
|1322000000| 1322000000| C1_F_ANO| CA| USD| PRICEITEM1| US| Country| 2018-06-06| 100|28-MAY-18| 100| TYPE1|PRICEITEM1|PRICEITEM2|
现在我想表演,
输出
+----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+------+---------+------+--------+----------+----------+
| accountId|accountNumber|acctNumberTypeCode|cisDivision|currencyCode|priceItemCd|priceItemParam|priceItemParamCode|processingDate|txnAmt| txnDttm|txnVol|udfChar1| udfChar2| udfChar3|
+----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+------+---------+------+--------+----------+----------+
|2032000000| 2032000000| C1_F_ANO| CA| USD| PRICEITEM2| UK| Country| 2018-06-06| 200|28-MAY-18| 100| TYPE1|PRICEITEM1|PRICEITEM2|
|1322000000| 1322000000| C1_F_ANO| CA| USD| PRICEITEM1| US| Country| 2018-06-06| 200|28-MAY-18| 100| TYPE1|PRICEITEM1|PRICEITEM2|
我不确定如何执行第1步?
我已编写代码执行第2步,根据accountId和帐号删除重复项:
String[] colNames = {"accountId", "accountNumber"};
Dataset<RuleOutputParams> finalDs = rulesParamDS.dropDuplicates(colNames);
有人可以帮忙吗?
答案 0 :(得分:1)
加载数据并为其创建SQL表
val df = spark.read.format("csv").option("header", true).load("data.csv")
df.createOrReplaceTempView("t")
然后,您需要的内容称为Window Aggregation functions,加上row_number()
删除重复项的技巧
val df2 = spark.sql("""SELECT * FROM (
SELECT *,
sum(txnAmt) OVER (PARTITION BY accountId, accountNumber) s,
row_number() OVER (PARTITION BY accountId, accountNumber ORDER BY processingDate) r FROM t)
WHERE r=1""")
.drop("txnAmt", "r")
.withColumnRenamed("s", "txnAmt")
如果你证明了这一点,你就会看到
+----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+---------+------+--------+----------+----------+------+
| accountId|accountNumber|acctNumberTypeCode|cisDivision|currencyCode|priceItemCd|priceItemParam|priceItemParamCode|processingDate| txnDttm|txnVol|udfChar1| udfChar2| udfChar3|txnAmt|
+----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+---------+------+--------+----------+----------+------+
|2032000000| 2032000000| C1_F_ANO| CA| USD| PRICEITEM2| UK| Country| 2018-06-06|28-MAY-18| 100| TYPE1|PRICEITEM1|PRICEITEM2| 200.0|
|1322000000| 1322000000| C1_F_ANO| CA| USD| PRICEITEM1| US| Country| 2018-06-06|28-MAY-18| 100| TYPE1|PRICEITEM1|PRICEITEM2| 200.0|
+----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+---------+------+--------+----------+----------+------+
作为旁注,可能会尝试在下面添加更多列,但您需要将它们添加到group by子句
spark.sql("SELECT accountId, accountNumber, SUM(txnAmt) txnAmt FROM t GROUP BY accountId, accountNumber").show
+----------+-------------+------+
| accountId|accountNumber|txnAmt|
+----------+-------------+------+
|2032000000| 2032000000| 200.0|
|1322000000| 1322000000| 200.0|
+----------+-------------+------+