矢量化方式,用于计算两列中任意一列的字符串出现次数

时间:2018-03-21 17:57:19

标签: python string pandas numpy dataframe

我遇到的问题this question类似,但只是差异太大,无法使用相同的解决方案解决...

我有两个数据框df1df2,如下所示:

import pandas as pd
import numpy as np
np.random.seed(42)
names = ['jack', 'jill', 'jane', 'joe', 'ben', 'beatrice']
df1 = pd.DataFrame({'ID_a':np.random.choice(names, 20), 'ID_b':np.random.choice(names,20)})    
df2 = pd.DataFrame({'ID':names})

>>> df1
        ID_a      ID_b
0        joe       ben
1        ben      jack
2       jane       joe
3        ben      jill
4        ben  beatrice
5       jill       ben
6       jane       joe
7       jane      jack
8       jane      jack
9        ben      jane
10       joe      jane
11      jane      jill
12  beatrice       joe
13       ben       joe
14      jill  beatrice
15       joe  beatrice
16  beatrice  beatrice
17  beatrice      jane
18      jill       joe
19       joe       joe

>>> df2
         ID
0      jack
1      jill
2      jane
3       joe
4       ben
5  beatrice

我想要做的是在df2中添加一列,df1中的行计数,其中可以找到给定的名称 ID_aID_b,结果如下:

>>> df2
         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

这个循环得到我需要的东西,但是对于大型数据帧来说是低效的,如果有人可以建议另一种更好的解决方案,我将非常感激:

df2['count'] = 0

for idx,row in df2.iterrows():
    df2.loc[idx, 'count'] = len(df1[(df1.ID_a == row.ID) | (df1.ID_b == row.ID)])

提前致谢!

4 个答案:

答案 0 :(得分:8)

“任何一个”部分使事情变得复杂,但仍应该可行。

选项1
由于其他用户决定将其变成速度竞赛,这是我的:

from collections import Counter
from itertools import chain

c = Counter(chain.from_iterable(set(x) for x in df1.values.tolist()))
df2['count'] = df2['ID'].map(Counter(c))
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

176 µs ± 7.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

选项2
(原始答案)stack基于

c = df1.stack().groupby(level=0).value_counts().count(level=1)

或者,

c = df1.stack().reset_index(level=0).drop_duplicates()[0].value_counts()

或者,

v = df1.stack()
c = v.groupby([v.index.get_level_values(0), v]).count().count(level=1)
# c = v.groupby([v.index.get_level_values(0), v]).nunique().count(level=1)

df2['count'] = df2.ID.map(c)
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

选项3
repeat - 基于重塑和计算

v = pd.DataFrame({
        'i' : df1.values.reshape(-1, ), 
        'j' : df1.index.repeat(2)
    })
c = v.loc[~v.duplicated(), 'i'].value_counts()

df2['count'] = df2.ID.map(c)
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

选项4
concat + mask

v = pd.concat(
    [df1.ID_a, df1.ID_b.mask(df1.ID_a == df1.ID_b)], axis=0
).value_counts()

df2['count'] = df2.ID.map(v)
df2

         ID  count
0      jack      3
1      jill      5
2      jane      8
3       joe      9
4       ben      7
5  beatrice      6

答案 1 :(得分:5)

以下是基于numpy数组的几种方法。以下基准测试。

重要:用一粒盐取这些结果。请记住,性能取决于您的数据,环境和硬件。在您的选择中,您还应考虑可读性/适应性。

分类数据jp2中的分类数据的卓越性能(即通过类似内部字典的结构将字符串分解为整数)是数据相关的,但如果它有效,它应该是适用于以下所有算法,具有良好的内存优势。

import pandas as pd
import numpy as np
from itertools import chain
from collections import Counter

# Tested on python 3.6.2 / pandas 0.20.3 / numpy 1.13.1

%timeit original(df1, df2)   # 48.4 ms per loop
%timeit jp1(df1, df2)        # 5.82 ms per loop
%timeit jp2(df1, df2)        # 2.20 ms per loop
%timeit brad(df1, df2)       # 7.83 ms per loop
%timeit cs1(df1, df2)        # 12.5 ms per loop
%timeit cs2(df1, df2)        # 17.4 ms per loop
%timeit cs3(df1, df2)        # 15.7 ms per loop
%timeit cs4(df1, df2)        # 10.7 ms per loop
%timeit wen1(df1, df2)       # 19.7 ms per loop
%timeit wen2(df1, df2)       # 32.8 ms per loop

def original(df1, df2):
    for idx,row in df2.iterrows():
        df2.loc[idx, 'count'] = len(df1[(df1.ID_a == row.ID) | (df1.ID_b == row.ID)])
    return df2

def jp1(df1, df2):
    for idx, item in enumerate(df2['ID']):
        df2.iat[idx, 1] = np.sum((df1.ID_a.values == item) | (df1.ID_b.values == item))
    return df2

def jp2(df1, df2):
    df2['ID'] = df2['ID'].astype('category')
    df1['ID_a'] = df1['ID_a'].astype('category')
    df1['ID_b'] = df1['ID_b'].astype('category')
    for idx, item in enumerate(df2['ID']):
        df2.iat[idx, 1] = np.sum((df1.ID_a.values == item) | (df1.ID_b.values == item))
    return df2

def brad(df1, df2):
    names1, names2 = df1.values.T
    v2 = df2.ID.values
    mask1 = v2 == names1[:, None]
    mask2 = v2 == names2[:, None]
    df2['count'] = np.logical_or(mask1, mask2).sum(axis=0)
    return df2

def cs1(df1, df2):
    c = Counter(chain.from_iterable(set(x) for x in df1.values.tolist()))
    df2['count'] = df2['ID'].map(Counter(c))
    return df2

def cs2(df1, df2):
    v = df1.stack().groupby(level=0).value_counts().count(level=1)
    df2['count'] = df2.ID.map(v)
    return df2

def cs3(df1, df2):
    v = pd.DataFrame({
            'i' : df1.values.reshape(-1, ), 
            'j' : df1.index.repeat(2)
        })
    c = v.loc[~v.duplicated(), 'i'].value_counts()

    df2['count'] = df2.ID.map(c)
    return df2

def cs4(df1, df2):
    v = pd.concat(
        [df1.ID_a, df1.ID_b.mask(df1.ID_a == df1.ID_b)], axis=0
    ).value_counts()

    df2['count'] = df2.ID.map(v)
    return df2

def wen1(df1, df2):
    return pd.get_dummies(df1, prefix='', prefix_sep='').sum(level=0,axis=1).gt(0).sum().loc[df2.ID]

def wen2(df1, df2):
    return pd.Series(Counter(list(chain(*list(map(set,df1.values)))))).loc[df2.ID]

<强>设置

import pandas as pd
import numpy as np

np.random.seed(42)

names = ['jack', 'jill', 'jane', 'joe', 'ben', 'beatrice']

df1 = pd.DataFrame({'ID_a':np.random.choice(names, 10000), 'ID_b':np.random.choice(names, 10000)})    

df2 = pd.DataFrame({'ID':names})

df2['count'] = 0

答案 2 :(得分:3)

使用get_dummies

pd.get_dummies(df1, prefix='', prefix_sep='').sum(level=0,axis=1).gt(0).sum().loc[df2.ID]
Out[614]: 
jack        3
jill        5
jane        8
joe         9
ben         7
beatrice    6
dtype: int64

我认为这应该很快......

from itertools import chain
from collections import Counter

pd.Series(Counter(list(chain(*list(map(set,df1.values)))))).loc[df2.ID]

答案 3 :(得分:3)

这是一个解决方案,您可以通过从cnt扩展ID的维度来有效地执行嵌套的“in”循环,以利用NumPy广播:

df2

面具的形状为>>> def count_names(df1, df2): ... names1, names2 = df1.values.T ... v2 = df2.ID.values[:, None] ... mask1 = v2 == names1 ... mask2 = v2 == names2 ... df2['count'] = np.logical_or(mask1, mask2).sum(axis=1) ... return df2 >>> %timeit -r 5 -n 1000 count_names(df1, df2) 144 µs ± 10.4 µs per loop (mean ± std. dev. of 5 runs, 1000 loops each) >>> %timeit -r 5 -n 1000 jp(df1, df2) 224 µs ± 15.5 µs per loop (mean ± std. dev. of 5 runs, 1000 loops each) >>> %timeit -r 5 -n 1000 cs(df1, df2) 238 µs ± 2.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) >>> %timeit -r 5 -n 1000 wen(df1, df2) 921 µs ± 15.3 µs per loop (mean ± std. dev. of 5 runs, 1000 loops each)