如何使用spark替换列内容

时间:2016-12-28 04:08:56

标签: scala apache-spark spark-dataframe

我有一个包含超过百万条记录的产品信息文件。 CSV文件如下所示:     

    Product    CategoryName    SalesUnit  Other Columns...
      p1           a12             41
p2 x5 72
p3 x5 69
p4 c21 80
p5 b16 59
p6 x5 75 .. .. ..
我有一个映射文件(CategoryCode< - > CategoryName)如下。映射文件有大约200条记录:     
    CategoryCode CategoryName
1.0 a12
2.0 b13 3.0 b16 4.0 c12
5.0 c21
6.0 x5
.. ..
最后,我想用CategoryCode替换CategoryName的值:     
    Product    Category    SalesUnit   Other Colulmns..
     p1          1.0           41
p2 6.0 72
p3 6.0 69
p4 5.0 80
p5 3.0 59
p6 6.0 75 .. .. ..
我的方法是使用spark数据帧的udf:     
    udf { (CategoryName: String) =>
        if (CategoryName.trim() == "a12") 1.0
        else if (CategoryName.trim() == "b13") 2.0
        else if (CategoryName.trim() == "b16") 3.0
        else if (CategoryName.trim() == "c12") 4.0
        else if (CategoryName.trim() == "c21") 5.0
        else if (CategoryName.trim() == "x5") 6.0
        else if (CategoryName.trim() == "z12") 7.0
        else if (...) ...
        ... ...
        else 999.0
    }
    
任何其他优雅的方法来实现替换,而不是通过编码这么多if ... else子句?感谢。

2 个答案:

答案 0 :(得分:4)

使用修剪类别的csv加入映射文件,然后仅选择您需要的字段

答案 1 :(得分:3)

您可以加入Categoryname上的dataFrame ,然后删除Categoryname本身,因为之后您不需要它。

您可以这样做:

scala> //Can have more columns , have taken just these columns just to demonstrate

scala> val df1=sc.parallelize(Seq(("p1","a12",41),("p2","x5",72),("p3","x5",69))).toDF("Product","CategoryName","SalesUnit")
df1: org.apache.spark.sql.DataFrame = [Product: string, CategoryName: string ... 1 more field]

scala> //Category code dataFrame

scala>  val df2=sc.parallelize(Seq((1.0,"a12"),(4.0,"c12"),(5.0,"c21"),(6.0,"x5"))).toDF("CategoryCode","CategoryName")
df2: org.apache.spark.sql.DataFrame = [CategoryCode: double, CategoryName: string]

scala> val resultDF=df1.join(df2,"CategoryName").withColumnRenamed("CategoryCode","Category").drop("CategoryName")
resultDF: org.apache.spark.sql.DataFrame = [Product: string, SalesUnit: int ... 1 more field]

scala> resultDF.show()
+-------+---------+--------+                                                    
|Product|SalesUnit|Category|
+-------+---------+--------+
|     p1|       41|     1.0|
|     p2|       72|     6.0|
|     p3|       69|     6.0|
+-------+---------+--------+

P.S:这只是一个小小的示范。