在我的数据框中,我有“ away_lineup”列,其中包含5个字符串的分组,还有一个“ play_length”列,其中每一行都有持续时间值。我知道np.unique可以检测唯一的字符串值,并且np.sum值会在列中添加值,但是我如何使用np.unique之类的函数来检测每个唯一的字符串并求和该字符串的“ play_length”值连续发生?
away_lineup play_length
0 Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, Ben Simmons 0:00:05
1 Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, Ben Simmons 0:00:10
2 Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, Ben Simmons 0:00:20
3 Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, Ben Simmons 0:00:07
4 Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, Ben Simmons 0:00:25
5 Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, JJ Redick 0:00:14
我想要的输出将是
player play_length
Dario Saric 0:01:21
Robert Covington 0:01:21
Joel Embiid 0:01:21
Markelle Fultz 0:01:21
Ben Simmons 0:01:07
JJ Redick 0:00:14
从“ away_lineup”中提取唯一名称的地方,存储在新列“ player”中,并且存在玩家值的行中添加了“ play_length”值。
答案 0 :(得分:0)
使用pandas.DataFrame.explode
和pandas.to_timedelta
:
注意:pandas.DataFrame.explode
可用于pandas
> = 0.25
df['away_lineup'] = df['away_lineup'].str.split(', ')
df['play_length'] = pd.to_timedelta(df['play_length'])
new_df = df.explode('away_lineup').groupby('away_lineup').sum()
print(new_df)
输出:
play_length
away_lineup
Ben Simmons 00:01:07
Dario Saric 00:01:21
JJ Redick 00:00:14
Joel Embiid 00:01:21
Markelle Fultz 00:01:21
Robert Covington 00:01:21
答案 1 :(得分:0)
如果您的熊猫不支持explode
:
df['play_length'] = pd.to_timedelta(df['play_length'])
new_df = pd.concat((df[['play_length']],
df['away_lineup'].str.split(',\s*', expand=True)),
axis=1)
(new_df.melt(id_vars=['play_length'],
value_vars=new_df.columns[1:],
value_name='artist')
.groupby('artist').play_length.sum()
)
输出:
artist
Ben Simmons 00:01:07
Dario Saric 00:01:21
JJ Redick 00:00:14
Joel Embiid 00:01:21
Markelle Fultz 00:01:21
Robert Covington 00:01:21
Name: play_length, dtype: timedelta64[ns]
答案 2 :(得分:0)
检查get_dummies
的技巧
#df['play_length'] = pd.to_timedelta(df['play_length'])
df.away_lineup.str.get_dummies(',').mul(df.play_length,0).sum()
Out[372]:
Ben Simmons 00:01:07
JJ Redick 00:00:14
Joel Embiid 00:01:21
Markelle Fultz 00:01:21
Robert Covington 00:01:21
Dario Saric 00:01:21
dtype: timedelta64[ns]
答案 3 :(得分:0)
您可以像这样使用爆炸和分组方式
import numpy as np
import pandas as pd
## create dummy data
arr = [("Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, Ben Simmons", "00:00:05"),
("Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, Ben Simmons", "00:00:10"),
("Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, Ben Simmons", "00:00:20"),
("Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, Ben Simmons", "00:00:07"),
("Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, Ben Simmons", "00:00:25"),
("Dario Saric, Robert Covington, Joel Embiid, Markelle Fultz, JJ Redick", "00:00:14"),]
df = pd.DataFrame(arr, columns=["Player", "Play Time"])
df["Play Time"] = pd.to_timedelta(df["Play Time"])
## Solution
df["Player"] = df["Player"].str.split(",")
df.explode("Player").groupby("Player").sum()
输出
Play Time
Player
Ben Simmons 00:01:07
JJ Redick 00:00:14
Joel Embiid 00:01:21
Markelle Fultz 00:01:21
Robert Covington 00:01:21
Dario Saric 00:01:21