我遇到的问题与this question类似,但只是差异太大,无法使用相同的解决方案解决...
我有两个数据框df1
和df2
,如下所示:
import pandas as pd
import numpy as np
np.random.seed(42)
names = ['jack', 'jill', 'jane', 'joe', 'ben', 'beatrice']
df1 = pd.DataFrame({'ID_a':np.random.choice(names, 20), 'ID_b':np.random.choice(names,20)})
df2 = pd.DataFrame({'ID':names})
>>> df1
ID_a ID_b
0 joe ben
1 ben jack
2 jane joe
3 ben jill
4 ben beatrice
5 jill ben
6 jane joe
7 jane jack
8 jane jack
9 ben jane
10 joe jane
11 jane jill
12 beatrice joe
13 ben joe
14 jill beatrice
15 joe beatrice
16 beatrice beatrice
17 beatrice jane
18 jill joe
19 joe joe
>>> df2
ID
0 jack
1 jill
2 jane
3 joe
4 ben
5 beatrice
我想要做的是在df2
中添加一列,df1
中的行计数,其中可以找到给定的名称 列ID_a
或ID_b
,结果如下:
>>> df2
ID count
0 jack 3
1 jill 5
2 jane 8
3 joe 9
4 ben 7
5 beatrice 6
这个循环得到我需要的东西,但是对于大型数据帧来说是低效的,如果有人可以建议另一种更好的解决方案,我将非常感激:
df2['count'] = 0
for idx,row in df2.iterrows():
df2.loc[idx, 'count'] = len(df1[(df1.ID_a == row.ID) | (df1.ID_b == row.ID)])
提前致谢!
答案 0 :(得分:8)
“任何一个”部分使事情变得复杂,但仍应该可行。
选项1
由于其他用户决定将其变成速度竞赛,这是我的:
from collections import Counter
from itertools import chain
c = Counter(chain.from_iterable(set(x) for x in df1.values.tolist()))
df2['count'] = df2['ID'].map(Counter(c))
df2
ID count
0 jack 3
1 jill 5
2 jane 8
3 joe 9
4 ben 7
5 beatrice 6
176 µs ± 7.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
选项2
(原始答案)stack
基于
c = df1.stack().groupby(level=0).value_counts().count(level=1)
或者,
c = df1.stack().reset_index(level=0).drop_duplicates()[0].value_counts()
或者,
v = df1.stack()
c = v.groupby([v.index.get_level_values(0), v]).count().count(level=1)
# c = v.groupby([v.index.get_level_values(0), v]).nunique().count(level=1)
和
df2['count'] = df2.ID.map(c)
df2
ID count
0 jack 3
1 jill 5
2 jane 8
3 joe 9
4 ben 7
5 beatrice 6
选项3
repeat
- 基于重塑和计算
v = pd.DataFrame({
'i' : df1.values.reshape(-1, ),
'j' : df1.index.repeat(2)
})
c = v.loc[~v.duplicated(), 'i'].value_counts()
df2['count'] = df2.ID.map(c)
df2
ID count
0 jack 3
1 jill 5
2 jane 8
3 joe 9
4 ben 7
5 beatrice 6
选项4
concat
+ mask
v = pd.concat(
[df1.ID_a, df1.ID_b.mask(df1.ID_a == df1.ID_b)], axis=0
).value_counts()
df2['count'] = df2.ID.map(v)
df2
ID count
0 jack 3
1 jill 5
2 jane 8
3 joe 9
4 ben 7
5 beatrice 6
答案 1 :(得分:5)
以下是基于numpy
数组的几种方法。以下基准测试。
重要:用一粒盐取这些结果。请记住,性能取决于您的数据,环境和硬件。在您的选择中,您还应考虑可读性/适应性。
分类数据:jp2
中的分类数据的卓越性能(即通过类似内部字典的结构将字符串分解为整数)是数据相关的,但如果它有效,它应该是适用于以下所有算法,具有良好的和内存优势。
import pandas as pd
import numpy as np
from itertools import chain
from collections import Counter
# Tested on python 3.6.2 / pandas 0.20.3 / numpy 1.13.1
%timeit original(df1, df2) # 48.4 ms per loop
%timeit jp1(df1, df2) # 5.82 ms per loop
%timeit jp2(df1, df2) # 2.20 ms per loop
%timeit brad(df1, df2) # 7.83 ms per loop
%timeit cs1(df1, df2) # 12.5 ms per loop
%timeit cs2(df1, df2) # 17.4 ms per loop
%timeit cs3(df1, df2) # 15.7 ms per loop
%timeit cs4(df1, df2) # 10.7 ms per loop
%timeit wen1(df1, df2) # 19.7 ms per loop
%timeit wen2(df1, df2) # 32.8 ms per loop
def original(df1, df2):
for idx,row in df2.iterrows():
df2.loc[idx, 'count'] = len(df1[(df1.ID_a == row.ID) | (df1.ID_b == row.ID)])
return df2
def jp1(df1, df2):
for idx, item in enumerate(df2['ID']):
df2.iat[idx, 1] = np.sum((df1.ID_a.values == item) | (df1.ID_b.values == item))
return df2
def jp2(df1, df2):
df2['ID'] = df2['ID'].astype('category')
df1['ID_a'] = df1['ID_a'].astype('category')
df1['ID_b'] = df1['ID_b'].astype('category')
for idx, item in enumerate(df2['ID']):
df2.iat[idx, 1] = np.sum((df1.ID_a.values == item) | (df1.ID_b.values == item))
return df2
def brad(df1, df2):
names1, names2 = df1.values.T
v2 = df2.ID.values
mask1 = v2 == names1[:, None]
mask2 = v2 == names2[:, None]
df2['count'] = np.logical_or(mask1, mask2).sum(axis=0)
return df2
def cs1(df1, df2):
c = Counter(chain.from_iterable(set(x) for x in df1.values.tolist()))
df2['count'] = df2['ID'].map(Counter(c))
return df2
def cs2(df1, df2):
v = df1.stack().groupby(level=0).value_counts().count(level=1)
df2['count'] = df2.ID.map(v)
return df2
def cs3(df1, df2):
v = pd.DataFrame({
'i' : df1.values.reshape(-1, ),
'j' : df1.index.repeat(2)
})
c = v.loc[~v.duplicated(), 'i'].value_counts()
df2['count'] = df2.ID.map(c)
return df2
def cs4(df1, df2):
v = pd.concat(
[df1.ID_a, df1.ID_b.mask(df1.ID_a == df1.ID_b)], axis=0
).value_counts()
df2['count'] = df2.ID.map(v)
return df2
def wen1(df1, df2):
return pd.get_dummies(df1, prefix='', prefix_sep='').sum(level=0,axis=1).gt(0).sum().loc[df2.ID]
def wen2(df1, df2):
return pd.Series(Counter(list(chain(*list(map(set,df1.values)))))).loc[df2.ID]
<强>设置强>
import pandas as pd
import numpy as np
np.random.seed(42)
names = ['jack', 'jill', 'jane', 'joe', 'ben', 'beatrice']
df1 = pd.DataFrame({'ID_a':np.random.choice(names, 10000), 'ID_b':np.random.choice(names, 10000)})
df2 = pd.DataFrame({'ID':names})
df2['count'] = 0
答案 2 :(得分:3)
使用get_dummies
pd.get_dummies(df1, prefix='', prefix_sep='').sum(level=0,axis=1).gt(0).sum().loc[df2.ID]
Out[614]:
jack 3
jill 5
jane 8
joe 9
ben 7
beatrice 6
dtype: int64
我认为这应该很快......
from itertools import chain
from collections import Counter
pd.Series(Counter(list(chain(*list(map(set,df1.values)))))).loc[df2.ID]
答案 3 :(得分:3)
这是一个解决方案,您可以通过从cnt
扩展ID
的维度来有效地执行嵌套的“in”循环,以利用NumPy广播:
df2
面具的形状为>>> def count_names(df1, df2):
... names1, names2 = df1.values.T
... v2 = df2.ID.values[:, None]
... mask1 = v2 == names1
... mask2 = v2 == names2
... df2['count'] = np.logical_or(mask1, mask2).sum(axis=1)
... return df2
>>> %timeit -r 5 -n 1000 count_names(df1, df2)
144 µs ± 10.4 µs per loop (mean ± std. dev. of 5 runs, 1000 loops each)
>>> %timeit -r 5 -n 1000 jp(df1, df2)
224 µs ± 15.5 µs per loop (mean ± std. dev. of 5 runs, 1000 loops each)
>>> %timeit -r 5 -n 1000 cs(df1, df2)
238 µs ± 2.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit -r 5 -n 1000 wen(df1, df2)
921 µs ± 15.3 µs per loop (mean ± std. dev. of 5 runs, 1000 loops each)
。