输入数据框
qual_id req_id attrib_code id qual_value qual_list_id qual_id list_value include_exclude
1638031.000000000... 72320.00000000000... PRICELIST 184234.0000000000... LIST 7488737.000000000... 1638031.000000000... All
1636832.000000000... 72319.00000000000... PRICELIST 184297.0000000000... LIST 7464085.000000000... 1636832.000000000... All
1638033.000000000... 72320.00000000000... PRICELIST 184232.0000000000... LIST 7488739.000000000... 1638033.000000000... All
1636834.000000000... 72319.00000000000... PRICELIST 184295.0000000000... LIST 7464087.000000000... 1636834.000000000... All
1639034.000000000... 72321.00000000000... PRICELIST 184426.0000000000... LIST 7515418.000000000... 1639034.000000000... Global Price List... I
1639034.000000000... 72321.00000000000... PRICELIST 184426.0000000000... LIST 7515417.000000000... 1639034.000000000... Global Price List I
1638035.000000000... 72320.00000000000... PRICELIST 184230.0000000000... LIST 7488741.000000000... 1638035.000000000... All
1636836.000000000... 72319.00000000000... PRICELIST 184293.0000000000... LIST 7464089.000000000... 1636836.000000000... All
1638037.000000000... 72320.00000000000... PRICELIST 184228.0000000000... LIST 7488743.000000000... 1638037.000000000... All
1636838.000000000... 72319.00000000000... PRICELIST 184291.0000000000... LIST 7464091.000000000... 1636838.000000000... All
1639038.000000000... 72321.00000000000... PRICELIST 184427.0000000000... LIST 7515419.000000000... 1639038.000000000... Global Price List I
1639038.000000000... 72321.00000000000... PRICELIST 184427.0000000000... LIST 7515420.000000000... 1639038.000000000... Global Price List... I
1638039.000000000... 72320.00000000000... PRICELIST 184226.0000000000... LIST 7488745.000000000... 1638039.000000000... All
1636840.000000000... 72319.00000000000... PRICELIST 184289.0000000000... LIST 7464093.000000000... 1636840.000000000... All
1638041.000000000... 72320.00000000000... PRICELIST 184224.0000000000... LIST 7488747.000000000... 1638041.000000000... All
1636842.000000000... 72319.00000000000... PRICELIST 184287.0000000000... LIST 7464095.000000000... 1636842.000000000... All
1639042.000000000... 72321.00000000000... PRICELIST 184428.0000000000... LIST 7515421.000000000... 1639042.000000000... Global Price List I
1639042.000000000... 72321.00000000000... PRICELIST 184428.0000000000... LIST 7515422.000000000... 1639042.000000000... Global Price List... I
1638043.000000000... 72320.00000000000... PRICELIST 184222.0000000000... LIST 7488749.000000000... 1638043.000000000... All
1638843.000000000... 72320.00000000000... PRICELIST 184384.0000000000... LIST 7515196.000000000... 1638843.000000000... Australia Price L... E
代码:
val aggregatedRdd: RDD[Row] = test.rdd.groupBy(r =>
(r.getAs[BigDecimal]("id").longValue(),r.getAs[BigDecimal]("req_id").longValue())
).map(row =>
// Mapping the Grouped Values to a new Row Object
Row(row._1._1,row._1._2,row._2.map(x => {
if(x.getAs[String]("include_exclude")==null || !x.getAs[String]("include_exclude").equalsIgnoreCase("e")){
if(x.getAs[String]("list_value")==null)
x.getAs[String]("qual_value")
else
x.getAs[String]("list_value")
}else{
null
}
}).filter { x => x!=null },row._2.map(y => {
if(y.getAs[String]("include_exclude")!=null && y.getAs[String]("include_exclude").equalsIgnoreCase("e")){
y.getAs[String]("list_value")
}else{
null
}
}).filter { y => y!=null })
)
输出:
+------------+------+--------------------+--------------------+
|id |req_id| include_pricelist| exclude_pricelist|
+------------+------+--------------------+--------------------+
| 184273| 72319| [All]| []|
| 184304| 72317| [All]|[Australia Price ...|
| 184275| 72319| [All]| []|
| 184382| 72320| [All]|[Australia Price ...|
| 184152| 72320| [All]| []|
| 184297| 72319| [All]| []|
| 184207| 72320| [All]| []|
| 184387| 72321|[Global Price Lis...| []|
| 184320| 72319| [All]| []|
| 184306| 72319| [All]| []|
| 184191| 72320| [All]| []|
| 184138| 72320| [All]| []|
| 184341| 72319| [All]| []|
| 184189| 72320| [All]| []|
| 184234| 72319| [All]| []|
| 184261| 72319| [All]| []|
| 184396| 72321|[Global Price Lis...| []|
| 184281| 72320| [All]| []|
| 184165| 72320| [All]| []|
| 184261| 72320| [All]| []|
+------------+------+--------------------+--------------------+
我正在使用上面的代码来实现spark 1.4中的group_concat功能
执行需要更长的时间。如何优化上述代码以实现相同的功能和更好的性能。
另外,如果代替空白数组,我可以在exclude_list