我有一个包含超过百万条记录的产品信息文件。 CSV文件如下所示:
Product CategoryName SalesUnit Other Columns... p1 a12 41我有一个映射文件(CategoryCode< - > CategoryName)如下。映射文件有大约200条记录:
p2 x5 72
p3 x5 69
p4 c21 80
p5 b16 59
p6 x5 75 .. .. ..
CategoryCode CategoryName最后,我想用CategoryCode替换CategoryName的值:
1.0 a12
2.0 b13 3.0 b16 4.0 c12
5.0 c21
6.0 x5
.. ..
Product Category SalesUnit Other Colulmns.. p1 1.0 41我的方法是使用spark数据帧的udf:
p2 6.0 72
p3 6.0 69
p4 5.0 80
p5 3.0 59
p6 6.0 75 .. .. ..
udf { (CategoryName: String) => if (CategoryName.trim() == "a12") 1.0 else if (CategoryName.trim() == "b13") 2.0 else if (CategoryName.trim() == "b16") 3.0 else if (CategoryName.trim() == "c12") 4.0 else if (CategoryName.trim() == "c21") 5.0 else if (CategoryName.trim() == "x5") 6.0 else if (CategoryName.trim() == "z12") 7.0 else if (...) ... ... ... else 999.0 }任何其他优雅的方法来实现替换,而不是通过编码这么多if ... else子句?感谢。
答案 0 :(得分:4)
使用修剪类别的csv加入映射文件,然后仅选择您需要的字段
答案 1 :(得分:3)
您可以加入Categoryname上的dataFrame ,然后删除Categoryname本身,因为之后您不需要它。
您可以这样做:
scala> //Can have more columns , have taken just these columns just to demonstrate
scala> val df1=sc.parallelize(Seq(("p1","a12",41),("p2","x5",72),("p3","x5",69))).toDF("Product","CategoryName","SalesUnit")
df1: org.apache.spark.sql.DataFrame = [Product: string, CategoryName: string ... 1 more field]
scala> //Category code dataFrame
scala> val df2=sc.parallelize(Seq((1.0,"a12"),(4.0,"c12"),(5.0,"c21"),(6.0,"x5"))).toDF("CategoryCode","CategoryName")
df2: org.apache.spark.sql.DataFrame = [CategoryCode: double, CategoryName: string]
scala> val resultDF=df1.join(df2,"CategoryName").withColumnRenamed("CategoryCode","Category").drop("CategoryName")
resultDF: org.apache.spark.sql.DataFrame = [Product: string, SalesUnit: int ... 1 more field]
scala> resultDF.show()
+-------+---------+--------+
|Product|SalesUnit|Category|
+-------+---------+--------+
| p1| 41| 1.0|
| p2| 72| 6.0|
| p3| 69| 6.0|
+-------+---------+--------+
P.S:这只是一个小小的示范。