实现spark 1.4中的group_concat功能 - 性能

时间:2016-12-14 14:01:19

标签: scala apache-spark apache-spark-sql

输入数据框

             qual_id                  req_id    attrib_code         id  qual_value          qual_list_id                 qual_id              list_value    include_exclude
1638031.000000000...    72320.00000000000...      PRICELIST 184234.0000000000...          LIST  7488737.000000000...    1638031.000000000...                     All                   
1636832.000000000...    72319.00000000000...      PRICELIST 184297.0000000000...          LIST  7464085.000000000...    1636832.000000000...                     All                   
1638033.000000000...    72320.00000000000...      PRICELIST 184232.0000000000...          LIST  7488739.000000000...    1638033.000000000...                     All                   
1636834.000000000...    72319.00000000000...      PRICELIST 184295.0000000000...          LIST  7464087.000000000...    1636834.000000000...                     All                   
1639034.000000000...    72321.00000000000...      PRICELIST 184426.0000000000...          LIST  7515418.000000000...    1639034.000000000...    Global Price List...                  I
1639034.000000000...    72321.00000000000...      PRICELIST 184426.0000000000...          LIST  7515417.000000000...    1639034.000000000...       Global Price List                  I
1638035.000000000...    72320.00000000000...      PRICELIST 184230.0000000000...          LIST  7488741.000000000...    1638035.000000000...                     All                   
1636836.000000000...    72319.00000000000...      PRICELIST 184293.0000000000...          LIST  7464089.000000000...    1636836.000000000...                     All                   
1638037.000000000...    72320.00000000000...      PRICELIST 184228.0000000000...          LIST  7488743.000000000...    1638037.000000000...                     All                   
1636838.000000000...    72319.00000000000...      PRICELIST 184291.0000000000...          LIST  7464091.000000000...    1636838.000000000...                     All                   
1639038.000000000...    72321.00000000000...      PRICELIST 184427.0000000000...          LIST  7515419.000000000...    1639038.000000000...       Global Price List                  I
1639038.000000000...    72321.00000000000...      PRICELIST 184427.0000000000...          LIST  7515420.000000000...    1639038.000000000...    Global Price List...                  I
1638039.000000000...    72320.00000000000...      PRICELIST 184226.0000000000...          LIST  7488745.000000000...    1638039.000000000...                     All                   
1636840.000000000...    72319.00000000000...      PRICELIST 184289.0000000000...          LIST  7464093.000000000...    1636840.000000000...                     All                   
1638041.000000000...    72320.00000000000...      PRICELIST 184224.0000000000...          LIST  7488747.000000000...    1638041.000000000...                     All                   
1636842.000000000...    72319.00000000000...      PRICELIST 184287.0000000000...          LIST  7464095.000000000...    1636842.000000000...                     All                   
1639042.000000000...    72321.00000000000...      PRICELIST 184428.0000000000...          LIST  7515421.000000000...    1639042.000000000...       Global Price List                  I
1639042.000000000...    72321.00000000000...      PRICELIST 184428.0000000000...          LIST  7515422.000000000...    1639042.000000000...    Global Price List...                  I
1638043.000000000...    72320.00000000000...      PRICELIST 184222.0000000000...          LIST  7488749.000000000...    1638043.000000000...                     All                   
1638843.000000000...    72320.00000000000...      PRICELIST 184384.0000000000...          LIST  7515196.000000000...    1638843.000000000...    Australia Price L...                  E

代码:

val aggregatedRdd: RDD[Row] = test.rdd.groupBy(r =>
                    (r.getAs[BigDecimal]("id").longValue(),r.getAs[BigDecimal]("req_id").longValue())
                            ).map(row =>
                            // Mapping the Grouped Values to a new Row Object
                            Row(row._1._1,row._1._2,row._2.map(x => {
                                if(x.getAs[String]("include_exclude")==null || !x.getAs[String]("include_exclude").equalsIgnoreCase("e")){
                                    if(x.getAs[String]("list_value")==null)
                                        x.getAs[String]("qual_value")
                                        else
                                            x.getAs[String]("list_value")
                                }else{
                                  null
                                }
                            }).filter { x => x!=null },row._2.map(y => {
                                if(y.getAs[String]("include_exclude")!=null && y.getAs[String]("include_exclude").equalsIgnoreCase("e")){
                                    y.getAs[String]("list_value")
                                }else{
                                    null
                                }
                            }).filter { y => y!=null })
                                    )

输出:

+------------+------+--------------------+--------------------+
|id          |req_id|   include_pricelist|   exclude_pricelist|
+------------+------+--------------------+--------------------+
|      184273| 72319|               [All]|                  []|
|      184304| 72317|               [All]|[Australia Price ...|
|      184275| 72319|               [All]|                  []|
|      184382| 72320|               [All]|[Australia Price ...|
|      184152| 72320|               [All]|                  []|
|      184297| 72319|               [All]|                  []|
|      184207| 72320|               [All]|                  []|
|      184387| 72321|[Global Price Lis...|                  []|
|      184320| 72319|               [All]|                  []|
|      184306| 72319|               [All]|                  []|
|      184191| 72320|               [All]|                  []|
|      184138| 72320|               [All]|                  []|
|      184341| 72319|               [All]|                  []|
|      184189| 72320|               [All]|                  []|
|      184234| 72319|               [All]|                  []|
|      184261| 72319|               [All]|                  []|
|      184396| 72321|[Global Price Lis...|                  []|
|      184281| 72320|               [All]|                  []|
|      184165| 72320|               [All]|                  []|
|      184261| 72320|               [All]|                  []|
+------------+------+--------------------+--------------------+

我正在使用上面的代码来实现spark 1.4中的group_concat功能 执行需要更长的时间。如何优化上述代码以实现相同的功能和更好的性能。 另外,如果代替空白数组,我可以在exclude_list

中获取null

0 个答案:

没有答案