我有一个数据集,如下所示:
id paramgroup1
1 CURRENCY=USD~COUNTRY=USA~CUSTCATEGORY=REGULAR
2 CURRENCY=USD~COUNTRY=USA~CUSTCATEGORY=GUEST
3 CURRENCY=INR~COUNTRY=IND~CUSTCATEGORY=REGULAR
现在,我想在此处添加一个计数列,该列对由定界符(〜)分隔的参数进行计数。 因此,经过Spark转换操作后的最终数据集,
id paramgroup1 count
1 CURRENCY=USD~COUNTRY=USA~CUSTCATEGORY=REGULAR 3
2 CURRENCY=USD~COUNTRY=USA~CUSTCATEGORY=GUEST 3
3 CURRENCY=INR~COUNTRY=IND 2
任何帮助将不胜感激。...
答案 0 :(得分:0)
//in scala,
import org.apache.spark.sql.functions._
val df1 = df.withColumn("count", size(split($"paramgroup1", "~")))
df1.show()