df.head()
Player Tourn Score
Tom a 65
Henry a 72
Johno a 69
Ingram a 79
Ben a 76
Harry a 66
Nick b 70
Ingram b 79
Johno b 69
我在各种锦标赛('a'到'm')中都有一个玩家得分的数据框。有些球员参加过多场比赛,有些球员只参加过一场比赛。我希望为每个玩家创建一个额外的列,如果玩家参加该锦标赛,则为1,如果没有,则为0(基本上是虚拟变量)。
看起来像这样(为每个玩家重复):
Player Tourn Score Tom(Dummy)
Tom a 65 1
Henry a 72 1
Johno a 69 1
Ingram a 79 1
Ben a 76 1
Harry a 66 1
Nick b 70 0
Ingram b 79 0
Johno b 69 0
在代码中实现此目的的最佳方法是什么? (理想情况下,我需要能够在大型数据帧中很好地扩展的东西!)
有兴趣听取您的回复。
答案 0 :(得分:4)
首先使用get_dummies
,然后groupby
transform
,any
join
,Tourn
,int
和{{3}}原来的:
df1 = pd.get_dummies(df['Player'])
df2 = df.join(df1.groupby(df['Tourn']).transform('any').astype(int))
另一种更快的解决方案(每场锦标赛只能让每位玩家参加一次):
df.join(df.groupby(['Tourn','Player']).size().unstack(fill_value=0), on='Tourn')
print (df2)
Player Tourn Score Ben Harry Henry Ingram Johno Nick Tom
0 Tom a 65 1 1 1 1 1 0 1
1 Henry a 72 1 1 1 1 1 0 1
2 Johno a 69 1 1 1 1 1 0 1
3 Ingram a 79 1 1 1 1 1 0 1
4 Ben a 76 1 1 1 1 1 0 1
5 Harry a 66 1 1 1 1 1 0 1
6 Nick b 70 0 0 0 1 1 1 0
7 Ingram b 79 0 0 0 1 1 1 0
8 Johno b 69 0 0 0 1 1 1 0
<强>计时强>:
N = 10000
a = ['Tom', 'Henry', 'Johno', 'Ingram', 'Ben', 'Harry', 'Nick', 'Ingram', 'Johno']
a = ['{}{}'.format(i, j) for i in range(5) for j in a]
df = pd.DataFrame({'Player':np.random.choice(a, size=N),
'Tourn':np.random.randint(1000, size=N).astype(str)})
df = df.sort_values('Tourn')
#print (df.head())
In [486]: %%timeit
...: df.join(df.groupby(['Tourn','Player']).size().unstack(fill_value=0), on='Tourn')
...:
100 loops, best of 3: 12.6 ms per loop
In [487]: %%timeit
...: df.join(pd.crosstab(df.Tourn, df.Player), on='Tourn')
10 loops, best of 3: 60.9 ms per loop
In [488]: %%timeit
...: df1 = pd.get_dummies(df['Player'])
...: df2 = df.join(df1.groupby(df['Tourn']).transform('any').astype(int))
...:
10 loops, best of 3: 120 ms per loop
In [489]: %%timeit
...: df.join(pd.get_dummies(df.Tourn).T.dot(pd.get_dummies(df.Player)), on='Tourn')
...:
1 loop, best of 3: 895 ms per loop
In [490]: %%timeit
...: dd = df.Tourn.str.get_dummies()
...: df.assign(**{x.Player: dd[x.Tourn] for x in df.itertuples()})
...:
1 loop, best of 3: 7.02 s per loop
In [491]: %%timeit
...: df.assign(**{x.Player:df.Tourn.eq(x.Tourn).astype(int) for x in df.itertuples()})
...:
1 loop, best of 3: 13.7 s per loop
警告
考虑到DataFrame
的组数和长度,结果无法解决性能问题,这将影响其中某些解决方案的时间安排。
答案 1 :(得分:2)
pd.get_dummies
, pd.DataFrame.dot
和 pd.DataFrame.join
我使用dot
执行交叉制表。我设计它使Tourn
值最终在索引中,并允许我在该列上使用join
。
df.join(pd.get_dummies(df.Tourn).T.dot(pd.get_dummies(df.Player)), on='Tourn')
Player Tourn Score Ben Harry Henry Ingram Johno Nick Tom
0 Tom a 65 1 1 1 1 1 0 1
1 Henry a 72 1 1 1 1 1 0 1
2 Johno a 69 1 1 1 1 1 0 1
3 Ingram a 79 1 1 1 1 1 0 1
4 Ben a 76 1 1 1 1 1 0 1
5 Harry a 66 1 1 1 1 1 0 1
6 Nick b 70 0 0 0 1 1 1 0
7 Ingram b 79 0 0 0 1 1 1 0
8 Johno b 69 0 0 0 1 1 1 0
无耻插头
答案 2 :(得分:1)
你可以做到
选项1 - 源自piRSquared的点
In [990]: df.join(pd.crosstab(df.Tourn, df.Player), on='Tourn')
Out[990]:
Player Tourn Score Ben Harry Henry Ingram Johno Nick Tom
0 Tom a 65 1 1 1 1 1 0 1
1 Henry a 72 1 1 1 1 1 0 1
2 Johno a 69 1 1 1 1 1 0 1
3 Ingram a 79 1 1 1 1 1 0 1
4 Ben a 76 1 1 1 1 1 0 1
5 Harry a 66 1 1 1 1 1 0 1
6 Nick b 70 0 0 0 1 1 1 0
7 Ingram b 79 0 0 0 1 1 1 0
8 Johno b 69 0 0 0 1 1 1 0
选项2
In [976]: df.assign(**{x.Player:df.Tourn.eq(x.Tourn).astype(int) for x in df.itertuples()})
Out[976]:
Player Tourn Score Ben Harry Henry Ingram Johno Nick Tom
0 Tom a 65 1 1 1 0 0 0 1
1 Henry a 72 1 1 1 0 0 0 1
2 Johno a 69 1 1 1 0 0 0 1
3 Ingram a 79 1 1 1 0 0 0 1
4 Ben a 76 1 1 1 0 0 0 1
5 Harry a 66 1 1 1 0 0 0 1
6 Nick b 70 0 0 0 1 1 1 0
7 Ingram b 79 0 0 0 1 1 1 0
8 Johno b 69 0 0 0 1 1 1 0
选项3
In [979]: dd = df.Tourn.str.get_dummies()
In [980]: df.assign(**{x.Player: dd[x.Tourn] for x in df.itertuples()})
Out[980]:
Player Tourn Score Ben Harry Henry Ingram Johno Nick Tom
0 Tom a 65 1 1 1 0 0 0 1
1 Henry a 72 1 1 1 0 0 0 1
2 Johno a 69 1 1 1 0 0 0 1
3 Ingram a 79 1 1 1 0 0 0 1
4 Ben a 76 1 1 1 0 0 0 1
5 Harry a 66 1 1 1 0 0 0 1
6 Nick b 70 0 0 0 1 1 1 0
7 Ingram b 79 0 0 0 1 1 1 0
8 Johno b 69 0 0 0 1 1 1 0
答案 3 :(得分:0)
遇到类似问题并找到最佳解决方案。 感谢https://www.ritchieng.com/pandas-creating-dummy-variables/
对于您而言,答案应该是:
df['Tom(Dummy)'] = df.Tourn.map({'b':0, 'a':1})
阅读为:
# using .map to create dummy variables
# df['category_name or new Dummy var. name '] = df.Category.map({'unique_term':0, 'unique_term2':1})
希望有帮助!