这是我正在使用的数据的一小部分:
+-------------------+----------------+---------+----------+----------+
| date| home_team|away_team|home_score|away_score|
+-------------------+----------------+---------+----------+----------+
|1872-11-30 00:00:00| Scotland| England| 0| 0|
|1873-03-08 00:00:00| England| Scotland| 4| 2|
|1874-03-07 00:00:00| Scotland| England| 2| 1|
|1875-03-06 00:00:00| England| Scotland| 2| 2|
|1876-03-04 00:00:00| Scotland| England| 3| 0|
|1876-03-25 00:00:00| Scotland| Wales| 4| 0|
|1877-03-03 00:00:00| England| Scotland| 1| 3|
|1877-03-05 00:00:00| Wales| Scotland| 0| 2|
|1878-03-02 00:00:00| Scotland| England| 7| 2|
|1878-03-23 00:00:00| Scotland| Wales| 9| 0|
|1879-01-18 00:00:00| England| Wales| 2| 1|
|1879-04-05 00:00:00| England| Scotland| 5| 4|
|1879-04-07 00:00:00| Wales| Scotland| 0| 3|
|1880-03-13 00:00:00| Scotland| England| 5| 4|
|1880-03-15 00:00:00| Wales| England| 2| 3|
我希望计算每支球队的比赛总数。
为此,我尝试创建一个列all_teams
,其中应包含home_team
和away_team
中的所有条目。
我试过了:
new_df = old_df.withColumn("all_teams", old_df.home_team) \
.withColumn("all_teams", old_df.away_team)
此查询已运行,但未向我提供正确的输出。
如何实现这一目标?
注意 - 我正在使用Pyspark v2.3
答案 0 :(得分:1)
使用F.array()
:
from pyspark.sql import functions as F
old_df.withColumn('all_teams', F.explode(F.array('home_team', 'away_team'))) \
.groupby('all_teams') \
.count() \
.show()