我正在尝试解决我的一个Spark作业的性能问题,并且我认为我在使用“ cogroup”函数时遇到了问题。我试图将两个数据帧组合在一起(两个数据帧都很大,因此都无法广播),并且简单的连接不起作用,因为我需要添加很多额外的处理逻辑。
这是两个数据框的示例
交易:
+---------+-----------------+--------+
| CardNum | TransactionTime | Amount |
+---------+-----------------+--------+
| ABC | 20190101 | 10.0 |
| ABC | 20180501 | 25.0 |
| DEF | 20181201 | 30.0 |
| ghi | 20180101 | 20.0 |
+---------+-----------------+--------+
查阅:
+---------+------------+-----------------+------------------+-------------+
| CardID | InternalId | RecordStartDate | RecordExpiryDate | AnotherCode |
+---------+------------+-----------------+------------------+-------------+
| abc | 10001 | 2018-01-01 | 2018-05-20 | A |
| def | 10002 | 2018-01-01 | 9999-12-31 | A |
| def | 10005 | 2018-01-01 | 9999-12-31 | B |
| ghi | 10003 | 2018-01-01 | 9999-12-31 | B |
| abc | 20001 | 2018-05-20 | 9999-12-31 | A |
+---------+------------+-----------------+------------------+-------------+
预期结果:
+---------+-----------------+--------+------------+--------------------------------------------------------------+
| CardNum | TransactionTime | Amount | InternalID | Additional Explanation |
+---------+-----------------+--------+------------+--------------------------------------------------------------+
| ABC | 20190101 | 10.0 | 2001 | For this txn time, this internal id matches |
| ABC | 20180501 | 25.0 | 1001 | For an older txn and same card as above, the older id matches |
| DEF | 20181201 | 30.0 | 1002 | If two results are valid, pick the internal id with code "A" |
| ghi | 20180101 | 20.0 | 1003 | Since only one match, keep the returned id |
+---------+-----------------+--------+------------+--------------------------------------------------------------+
我当前如何加入数据:
// Conversion to lowercase is needed because the grouping needs to ignore case
transactionsGroupedDF = transactionsDF.groupByKey(item => item.getAs[String]("CardNum").toLowerCase)
lookupGroupedDF = lookupDF.groupByKey(item => item.getAs[String]("CardID").toLowerCase)
val resultDF = transactionsGroupedDF.cogroup(lookupGroupedDF) {
case (key, iter1, iter2) =>
val txnDataList = iter1.toList
val lookupList = iter2.toList
txnData.map(item => resolveInternalId(item, lookupTables, key))
}RowEncoder(transactionDF.schema.add("internalID","String)
我认为我需要在此处进行联合小组讨论,因为我确实需要两种情况的数据,特别是因为需要使用正确的内部ID来丰富交易,因为:交易日期,围绕“ AnotherCode”的业务规则,以及在无法解析内部ID时编写异常值的功能。
我相信代码可以按预期工作,但是我只是担心我没有以最佳方式进行此转换。多个groupByKey
呼叫使我感到担忧,对cogroup
的呼叫也使我担心,因为我不是100%熟悉它。
任何反馈将不胜感激。谢谢!