在Spark / Scala中为每个GroupBy运行自己的功能

时间:2019-08-29 20:02:04

标签: scala apache-spark hadoop apache-spark-sql hdfs

必须在Spark / Scala中处理行,请帮助。

我有以下数据框DF1:

ACC_SECURITY|ACCOUNT_NO|COSTCENTER|    BU|   MPU|LONG_IND|SHORT_IND|SECURITY_ID|QUANTITY|POS_NEG_QUANTITY|PROCESSED|ALLOC_QUANTITY|NET_QUANTITY|
+-------------+----------+----------+------+------+--------+---------+-----------+--------+----------------+---------+--------------+------------+
|3FA34789290X2|  3FA34789|    0800TS|BOXXBU|BOXXMP|    0101|     5279|      290X2|   18063|               P|         |             0|           0|
|3FA34782290X2|  3FA34782|    0800TS|BOXXBU|BOXXMP|    0102|     5322|      290X2|    -863|               N|         |             0|           0|
|3FA34789290X2|  3FA34789|    0800TS|BOXXBU|BOXXMP|    0101|     5279|      290X2| -108926|               N|         |             0|           0|
|9211530135G71|  92115301|    08036C|BOXXBU|BOXXMP|    0154|     8380|      35G71|    8003|               P|         |             0|           0|
|9211530235G71|  92115302|    08036C|BOXXBU|BOXXMP|    0144|     8382|      35G71|   -2883|               N|         |             0|           0|

数据框DF2:

|ENTITY|MATRIX|PRIORITY|LONG_CODE|SHORT_CODE|
+------+------+--------+---------+----------+
    300|    00|   16600|     0101|      5322|
    300|    00|   19900|     0101|      5279|
    300|    00|  298300|     0102|      5279|
    300|    00|   17800|     0154|      8382|
    300|    00|  505900|     0233|      5279|

我想按SECURITY_ID在上述数据框上进行分组...然后将获得以下2个分组:

ACC_SECURITY|ACCOUNT_NO|COSTCENTER|    BU|   MPU|LONG_IND|SHORT_IND|SECURITY_ID|QUANTITY|POS_NEG_QUANTITY|PROCESSED|ALLOC_QUANTITY|NET_QUANTITY|
+-------------+----------+----------+------+------+--------+---------+-----------+--------+----------------+---------+--------------+------------+
|3FA34789290X2|  3FA34789|    0800TS|BOXXBU|BOXXMP|    0101|     5279|      290X2|   18063|               P|         |             0|           0|
|3FA34782290X2|  3FA34782|    0800TS|BOXXBU|BOXXMP|    0102|     5322|      290X2|    -863|               N|         |             0|           0|
|3FA34789290X2|  3FA34789|    0800TS|BOXXBU|BOXXMP|    0101|     5279|      290X2| -108926|               N|         |             0|           0|

ACC_SECURITY|ACCOUNT_NO|COSTCENTER|    BU|   MPU|LONG_IND|SHORT_IND|SECURITY_ID|QUANTITY|POS_NEG_QUANTITY|PROCESSED|ALLOC_QUANTITY|NET_QUANTITY|
+-------------+----------+----------+------+------+--------+---------+-----------+--------+----------------+---------+--------------+------------+
|9211530135G71|  92115301|    08036C|BOXXBU|BOXXMP|    0154|     8380|      35G71|    8003|               P|         |             0|           0|
|9211530235G71|  92115302|    08036C|BOXXBU|BOXXMP|    0144|     8382|      35G71|   -2883|               N|         |             0|           0|
+-------------+----------+----------+------+------+--------+---------+-----------+--------+----------------+---------+--------------+------------+

然后在每个组中,我想 -从QUANTITY为正的行中提取LONG_IND,从QUANTITY为负的行中提取SHORT_IND 并从优先级数据帧DF2中查找优先级,并按PRIORITY的升序排序

仅针对第一组进行计算,我得到以下优先级数据:

|LONG_CODE|SHORT_CODE|PRIORITY|
+---------+----------+--------+|
0101|      5322|    16600|
0101|      5279|    19900|
0102|      5279|    298300|

然后根据上述优先级通过添加数量来处理DF1 ...在这里我们进行2次迭代 因此,Row1.Quantity通过基于结果是正数还是负数更新其他列来与Row2.Quantity相加 -ALLOC_QUANTITY-中和了多少数量(此处为18063-863,所以中和了863 -NET_QUANTITY-第二次迭代还需要处理多少数量(此处剩余18063-863 = 17200) -已处理-如果NET_QUANTITY为零,则输入“ p”。所以这里只处理第二行,所以只处理= P

迭代1:处理第1行和第2行。数量= 18063 +(-863)= 17200(正数量保留了第一行的其他列)

ACC_SECURITY|ACCOUNT_NO|COSTCENTER|    BU|   MPU|LONG_IND|SHORT_IND|SECURITY_ID|QUANTITY|POS_NEG_QUANTITY|PROCESSED|ALLOC_QUANTITY|NET_QUANTITY|
+-------------+----------+----------+------+------+--------+---------+-----------+--------+----------------+---------+--------------+------------+
    3FA34789290X2|  3FA34789|    0800TS|BOXXBU|BOXXMP|    0101|     5279|      290X2|   18063|               P|       N  |           863|        17200|
    3FA34782290X2|  3FA34782|    0800TS|BOXXBU|BOXXMP|    0102|     5322|      290X2|    -863|               N|       P  |             0|           0|
    3FA34789290X2|  3FA34789|    0800TS|BOXXBU|BOXXMP|    0101|     5279|      290X2| -108926|               N|          |             0|           0|

迭代2:处理第1行和第3行。数量= 17200 +(-108926)= -91726(负数量保留了第三行的其他列)

ACC_SECURITY|ACCOUNT_NO|COSTCENTER|    BU|   MPU|LONG_IND|SHORT_IND|SECURITY_ID|QUANTITY|POS_NEG_QUANTITY|PROCESSED|ALLOC_QUANTITY|NET_QUANTITY|
+-------------+----------+----------+------+------+--------+---------+-----------+--------+----------------+---------+--------------+------------+
3FA34789290X2|  3FA34789|    0800TS|BOXXBU|BOXXMP|    0101|     5279|      290X2|   18063|               P|       P  |           863|        17200|
3FA34782290X2|  3FA34782|    0800TS|BOXXBU|BOXXMP|    0102|     5322|      290X2|    -863|               N|       P  |             0|           0|
3FA34789290X2|  3FA34789|    0800TS|BOXXBU|BOXXMP|    0101|     5279|      290X2| -108926|               N|       P  |             0|           0|

这里第1行。已处理成为P,第3行。仅由于组全部完成了3行处理,因此已处理成为P。

我在下面尝试过,但是无法破解,请帮助如何创建一个函数来对每个组中的行进行上述迭代。也许使用GroupByKey和mapgroups。

case class AllocOneProcess(ACC_SECURITY: String, ACCOUNT_NO: String, COSTCENTER: String, BU: String, MPU: String, LONG_IND: String, SHORT_IND: String, SECURITY_ID: String, QUANTITY: String, POS_NEG_QUANTITY: String, PROCESSED: String, ALLOC_QUANTITY: Integer, NET_QUANTITY: Integer)

val toBeProcessedAllocOneDF2 = toBeProcessedAllocOneDF.as[AllocOneProcess]
        val toBeProcessedAllocOneDF3 = toBeProcessedAllocOneDF.toDF()

        prioritymatrixDF.show()

        //toBeProcessedAllocOneDF
        val x = toBeProcessedAllocOneDF2
          .groupByKey(_.SECURITY_ID)
          .mapGroups{
            case (nameKey, df) => {
            allocOneProcess(df,)
          }
          }

0 个答案:

没有答案