我有Dataset
,如下所示
monthYear code
201601 11
201601 12
201601 12
201601 10
201602 null
201602 21
201602 21
201602 21
201603 null
当code
为null
时,我想将其替换为上个月出现次数最多的code
。对于上面的示例,第一个null
将替换为12
,第二个21
将替换为monthYear code
201601 11
201601 12
201601 12
201601 10
201602 12
201602 21
201602 21
201602 21
201603 21
。
结果如下:
Exception in thread "main" java.lang.ClassCastException:
java.util.LinkedList cannot be cast to java.lang.Class
at *.model.communication.GsonStream.createRequest(GsonStream.java:1)
at *.client.HttpClient.(HttpClient.java:26)
at *.client.HttpClient.main(HttpClient.java:33)
我怎样才能做到这一点?
答案 0 :(得分:0)
您可以使用Class DataFrameNaFunctions https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/DataFrameNaFunctions.html#fill(double)
的填充转换在Dataframes中替换null或NaN值
实施例
{'1995': -15152, '1997': -5421, '1996': -3319, '1999': -3791, '1998': 2936, '2002': -8158, '2003': 1675, '2000': 19809, '2001': -15866}
同样,如果您想要替换此列的值
val df = spark.read.json("../test.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.show
+----+----+
| age|name|
+----+----+
| 12| xyz|
|null| abc|
+----+----+
df.na.fill(0, Seq("age"))
res3.show
+---+----+
|age|name|
+---+----+
| 12| xyz|
| 0| abc|
+---+----+
但同样,它不会取代不同的值,你必须在源端做一些事情
答案 1 :(得分:0)
您需要使用窗口函数找到最大值并合并以获得所需内容。
我们假设df是一个包含您所显示的表的数据框(变量),
df = df.selectExpr("*","count(code) over (partition by monthYear) as code_count")
df = df.selectExpr("*","rank(code) over (partition by monthYear order by code_count) as max_code")
df = df.selectExpr("*","coalesce(code,max_code) as code_new")
会给你你想要的东西。