我在data frame
中有一个pyspark
,如下所示。
df.show()
+---+----+
| id|test|
+---+----+
| 1| Y|
| 1| N|
| 2| Y|
| 3| N|
+---+----+
我想在重复id
且test
为N
现在,当我查询new_df
new_df.show()
+---+----+
| id|test|
+---+----+
| 1| Y|
| 2| Y|
| 3| N|
+---+----+
我无法弄清楚用例。
我已经对id
计数进行了groupby,但它只提供了id
列和count
。
我在下面做过。
grouped_df = new_df.groupBy("id").count()
如何实现我想要的结果
修改
我有一个如下数据框。
+-------------+--------------------+--------------------+
| sn| device| attribute|
+-------------+--------------------+--------------------+
|4MY16A5602E0A| Android Phone| N|
|4MY16A5W02DE8| Android Phone| N|
|4MY16A5W02DE8| Android Phone| Y|
|4VT1735J00337| TV| N|
|4VT1735J00337| TV| Y|
|4VT47B52003EE| Router| N|
|4VT47C5N00A10| Other| N|
+-------------+--------------------+--------------------+
当我这样做时
new_df = df.groupBy("sn").agg(max("attribute").alias("attribute"))
我收到str has no attribute alias
错误
预期结果应如下所示
+-------------+--------------------+--------------------+
| sn| device| attribute|
+-------------+--------------------+--------------------+
|4MY16A5602E0A| Android Phone| N|
|4MY16A5W02DE8| Android Phone| Y|
|4VT1735J00337| TV| Y|
|4VT47B52003EE| Router| N|
|4VT47C5N00A10| Other| N|
+-------------+--------------------+--------------------+
答案 0 :(得分:4)
不是最通用的解决方案,但应该很适合:
from pyspark.sql.functions import max
df = spark.createDataFrame(
[(1, "Y"), (1, "N"), (2, "Y"), (3, "N")], ("id", "test")
)
df.groupBy("id").agg(max("test").alias("test")).show()
# +---+----+
# | id|test|
# +---+----+
# | 1| Y|
# | 3| N|
# | 2| Y|
# +---+----+
更通用的一个:
from pyspark.sql.functions import col, count, when
test = when(count(when(col("test") == "Y", "Y")) > 0, "Y").otherwise("N")
df.groupBy("id").agg(test.alias("test")).show()
# +---+----+
# | id|test|
# +---+----+
# | 1| Y|
# | 3| N|
# | 2| Y|
# +---+----+
可以推广以容纳更多类和非平凡的排序,例如,如果您按此顺序评估了三个类Y
,?
,N
,则可以:< / p>
(when(count(when(col("test") == "Y", True)) > 0, "Y")
.when(count(when(col("test") == "?", True)) > 0, "?")
.otherwise("N"))
如果还有其他列需要保留这些方法,那么您将需要Find maximum row per group in Spark DataFrame
中显示的内容答案 1 :(得分:3)
使用row_number
的另一个选项:
df.selectExpr(
'*',
'row_number() over (partition by id order by test desc) as rn'
).filter('rn=1 or test="Y"').drop('rn').show()
+---+----+
| id|test|
+---+----+
| 1| Y|
| 3| N|
| 2| Y|
+---+----+
此方法不会聚合,但只在测试为N
答案 2 :(得分:0)
使用Spark SQL临时表,我使用了Databricks Notebook
case class T(id:Int,test:String)
val df=spark.createDataset(Seq(T(1, "Y"), T(1, "N"), T(2, "Y"), T(3, "N")))
df.createOrReplaceTempView("df")
%sql select id, max(test) from df group by id
答案 3 :(得分:0)
您可以使用以下代码:
#register as temp table
df.registerTempTable("df")
#create single rows
newDF = sqlc.sql(WITH dfCte AS
(
select *,row_number() over (partition by id order by test desc) as RowNumber
from df
)
select * from dfCte where RowNumber =1)
#drop row numbers and show the newdf
newDF.drop('RowNumber').show()