给出如下表格:
+--+------------------+-----------+
|id| diagnosis_age| diagnosis|
+--+------------------+-----------+
| 1|2.1843037179180302| 315.320000|
| 1| 2.80033330216659| 315.320000|
| 1| 2.8222365762732| 315.320000|
| 1| 5.64822705794013| 325.320000|
| 1| 5.686557787521759| 335.320000|
| 2| 5.70572315231258| 315.320000|
| 2| 5.724888517103389| 315.320000|
| 3| 5.744053881894209| 315.320000|
| 3|5.7604813374292005| 315.320000|
| 3| 5.77993740687426| 315.320000|
+--+------------------+-----------+
我正在尝试通过对该ID进行最频繁的诊断来将每个ID的记录减少到一个。
如果是rdd,则可以执行以下操作:
rdd.map(lambda x: (x["id"], [(x["diagnosis_age"], x["diagnosis"])]))\
.reduceByKey(lambda x, y: x + y)\
.map(lambda x: [i[1] for i in x[1]])\
.map(lambda x: [max(zip((x.count(i) for i in set(x)), set(x)))])
在sql中:
select id, diagnosis, diagnosis_age
from (select id, diagnosis, diagnosis_age, count(*) as cnt,
row_number() over (partition by id order by count(*) desc) as seqnum
from t
group by id, diagnosis, age
) da
where seqnum = 1;
所需的输出:
+--+------------------+-----------+
|id| diagnosis_age| diagnosis|
+--+------------------+-----------+
| 1|2.1843037179180302| 315.320000|
| 2| 5.70572315231258| 315.320000|
| 3| 5.744053881894209| 315.320000|
+--+------------------+-----------+
如果可能,如何仅使用spark数据框操作来实现相同目的?特别是不使用任何rdd操作/ sql。
谢谢
答案 0 :(得分:1)
Python::这是我的Scala代码的转换。
from pyspark.sql.functions import col, first, count, desc, row_number
from pyspark.sql import Window
df.groupBy("id", "diagnosis").agg(first(col("diagnosis_age")).alias("diagnosis_age"), count(col("diagnosis_age")).alias("cnt")) \
.withColumn("seqnum", row_number().over(Window.partitionBy("id").orderBy(col("cnt").desc()))) \
.where("seqnum = 1") \
.select("id", "diagnosis_age", "diagnosis", "cnt") \
.orderBy("id") \
.show(10, False)
Scala::您的查询对我来说没有意义。 groupBy
条件导致该记录的计数始终为1
。我在数据框表达式中修改了一些内容,例如
import org.apache.spark.sql.expressions.Window
df.groupBy("id", "diagnosis").agg(first(col("diagnosis_age")).as("diagnosis_age"), count(col("diagnosis_age")).as("cnt"))
.withColumn("seqnum", row_number.over(Window.partitionBy("id").orderBy(col("cnt").desc)))
.where("seqnum = 1")
.select("id", "diagnosis_age", "diagnosis", "cnt")
.orderBy("id")
.show(false)
结果为:
+---+------------------+---------+---+
|id |diagnosis_age |diagnosis|cnt|
+---+------------------+---------+---+
|1 |2.1843037179180302|315.32 |3 |
|2 |5.70572315231258 |315.32 |2 |
|3 |5.744053881894209 |315.32 |3 |
+---+------------------+---------+---+
答案 1 :(得分:1)
您可以将count
,max
,first
与窗口功能一起使用,并在count=max
上进行过滤。
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("id","diagnosis").orderBy("diagnosis_age")
w2=Window().partitionBy("id")
df.withColumn("count", F.count("diagnosis").over(w))\
.withColumn("max", F.max("count").over(w2))\
.filter("count=max")\
.groupBy("id").agg(F.first("diagnosis_age").alias("diagnosis_age"),F.first("diagnosis").alias("diagnosis"))\
.orderBy("id").show()
+---+------------------+---------+
| id| diagnosis_age|diagnosis|
+---+------------------+---------+
| 1|2.1843037179180302| 315.32|
| 2| 5.70572315231258| 315.32|
| 3| 5.744053881894209| 315.32|
+---+------------------+---------+