我的数据框看起来像
+-------------------------+-----+
| Title| Status|Suite|ID |Time |
+------+-------+-----+----+-----+
|KIM | Passed|ABC |123 |20 |
|KJT | Passed|ABC |123 |10 |
|ZXD | Passed|CDF |123 |15 |
|XCV | Passed|GHY |113 |36 |
|KJM | Passed|RTH |456 |45 |
|KIM | Passed|ABC |115 |47 |
|JY | Passed|JHJK |8963|74 |
|KJH | Passed|SNMP |256 |47 |
|KJH | Passed|ABC |123 |78 |
|LOK | Passed|GHY |456 |96 |
|LIM | Passed|RTH |113 |78 |
|MKN | Passed|ABC |115 |74 |
|KJM | Passed|GHY |8963|74 |
+------+-------+-----+----+-----+
可以使用
创建df = sqlCtx.createDataFrame(
[
('KIM', 'Passed', 'ABC', '123',20),
('KJT', 'Passed', 'ABC', '123',10),
('ZXD', 'Passed', 'CDF', '123',15),
('XCV', 'Passed', 'GHY', '113',36),
('KJM', 'Passed', 'RTH', '456',45),
('KIM', 'Passed', 'ABC', '115',47),
('JY', 'Passed', 'JHJK', '8963',74),
('KJH', 'Passed', 'SNMP', '256',47),
('KJH', 'Passed', 'ABC', '123',78),
('LOK', 'Passed', 'GHY', '456',96),
('LIM', 'Passed', 'RTH', '113',78),
('MKN', 'Passed', 'ABC', '115',74),
('KJM', 'Passed', 'GHY', '8963',74),
],('Title', 'Status', 'Suite', 'ID','Time')
)
我需要在ID上使用group by
,在时间上使用aggregation
,在结果中我需要获取标题,状态和&套房以及ID。
我的输出应该是
+-------------------------+-----+
| Title| Status|Suite| ID|Time |
+------+-------+-----+----+-----+
|KIM | Passed|ABC |123 |30.75|
|XCV | Passed|GHY |113 |57 |
|KJM | Passed|RTH |456 |70.5 |
|KIM | Passed|ABC |115 |60.5 |
|JY | Passed|JHJK |8963|74 |
|KJH | Passed|SNMP |256 |47 |
+------+-------+-----+----+-----+
我尝试过以下代码。但它只给了我结果中ID的值
df.groupBy("ID").agg(mean("Time").alias("Time"))
答案 0 :(得分:2)
使用修改后的预期输出,您可以使用navigationController?.navigationBar.isTranslucent = true
获得任意值:
first
原始回答
看起来你想要from pyspark.sql.functions import avg, first
df.groupBy("id").agg(
first("Title"), first("Status"), first("Suite"), avg("Time")
).toDF("id", "Title", "Status", "Suite", "Time").show()
# +----+-----+------+-----+-----+
# | id|Title|Status|Suite| Time|
# +----+-----+------+-----+-----+
# | 113| XCV|Passed| GHY| 57.0|
# | 256| KJH|Passed| SNMP| 47.0|
# | 456| KJM|Passed| RTH| 70.5|
# | 115| KIM|Passed| ABC| 60.5|
# |8963| JY|Passed| JHJK| 74.0|
# | 123| KIM|Passed| ABC|30.75|
# +----+-----+------+-----+-----+
:
drop_duplicates
如果您想使用特定行,请参阅Find maximum row per group in Spark DataFrame