我有一个由这样的字符串组成的数据框:
ID_0 ID_1
g k
a h
c i
j e
d i
i h
b b
d d
i a
d h
对于每对字符串,我可以计算其中有多少行包含其中的字符串,如下所示。
import pandas as pd
import itertools
df = pd.read_csv("test.csv", header=None, prefix="ID_", usecols = [0,1])
alphabet_1 = set(df['ID_0'])
alphabet_2 = set(df['ID_1'])
# This just makes a set of all the strings in the dataframe.
alphabet = alphabet_1 | alphabet_2
#This iterates over all pairs and counts how many rows have either in either column
for (x,y) in itertools.combinations(alphabet, 2):
print x, y, len(df.loc[df['ID_0'].isin([x,y]) | df['ID_1'].isin([x,y])])
这给出了:
a c 3
a b 3
a e 3
a d 5
a g 3
a i 5
a h 4
a k 3
a j 3
c b 2
c e 2
c d 4
[...]
问题是我的数据帧非常大且字母表大小为200,并且此方法对每对字母在整个数据帧上进行独立遍历。
是否可以通过以某种方式对数据帧进行单次传递来获得相同的输出?
计时
我创建了一些数据:
import numpy as np
import pandas as pd
from string import ascii_lowercase
n = 10**4
data = np.random.choice(list(ascii_lowercase), size=(n,2))
df = pd.DataFrame(data, columns=['ID_0', 'ID_1'])
#Testing Parfait's answer
def f(row):
ser = len(df[(df['ID_0'] == row['ID_0']) | (df['ID_1'] == row['ID_0'])|
(df['ID_0'] == row['ID_1']) | (df['ID_1'] == row['ID_1'])])
return(ser)
%timeit df.apply(f, axis=1)
1 loops, best of 3: 37.8 s per loop
我希望能够为n = 10 ** 8做到这一点。这可以加快吗?
答案 0 :(得分:1)
考虑使用DataFrame.apply()
方法:
from io import StringIO
import pandas as pd
data = '''ID_0,ID_1
g,k
a,h
c,i
j,e
d,i
i,h
b,b
d,d
i,a
d,h
'''
df = pd.read_csv(StringIO(data))
def f(row):
ser = len(df[(df['ID_0'] == row['ID_0']) | (df['ID_1'] == row['ID_0'])|
(df['ID_0'] == row['ID_1']) | (df['ID_1'] == row['ID_1'])])
return(ser)
df['CountIDs'] = df.apply(f, axis=1)
print df
# ID_0 ID_1 CountIDs
# 0 g k 1
# 1 a h 4
# 2 c i 4
# 3 j e 1
# 4 d i 6
# 5 i h 6
# 6 b b 1
# 7 d d 3
# 8 i a 5
# 9 d h 5
替代解决方案:
# VECTORIZED w/ list comprehension
def f(x, y, z):
ser = [len(df[(df['ID_0'] == x[i]) | (df['ID_1'] == x[i])|
(df['ID_0'] == y[i]) | (df['ID_1'] == y[i])]) for i in z]
return(ser)
df['CountIDs'] = f(df['ID_0'], df['ID_1'], df.index)
# USING map()
def f(x, y):
ser = len(df[(df['ID_0'] == x) | (df['ID_1'] == x)|
(df['ID_0'] == y) | (df['ID_1'] == y)])
return(ser)
df['CountIDs'] = list(map(f, df['ID_0'], df['ID_1']))
# USING zip() w/ list comprehnsion
def f(x, y):
ser = len(df[(df['ID_0'] == x) | (df['ID_1'] == x)|
(df['ID_0'] == y) | (df['ID_1'] == y)])
return(ser)
df['CountIDs'] = [f(x,y) for x,y in zip(df['ID_0'], df['ID_1'])]
# USING apply() w/ isin()
def f(row):
ser = len(df[(df['ID_0'].isin([row['ID_0'], row['ID_1']]))|
(df['ID_1'].isin([row['ID_0'], row['ID_1']]))])
return(ser)
df['CountIDs'] = df.apply(f, axis=1)
答案 1 :(得分:1)
您可以通过使用一些聪明的组合/集合理论来计算行级别:
# Count of individual characters and pairs.
char_count = df['ID_0'].append(df.loc[df['ID_0'] != df['ID_1'], 'ID_1']).value_counts().to_dict()
pair_count = df.groupby(['ID_0', 'ID_1']).size().to_dict()
# Get the counts.
df['count'] = [char_count[x] if x == y else char_count[x] + char_count[y] - (pair_count[x,y] + pair_count.get((y,x),0)) for x,y in df[['ID_0', 'ID_1']].values]
结果输出:
ID_0 ID_1 count
0 g k 1
1 a h 4
2 c i 4
3 j e 1
4 d i 6
5 i h 6
6 b b 1
7 d d 3
8 i a 5
9 d h 5
我已经将我的方法的输出与行级迭代方法进行了比较,该数据集包含5000行且所有计数都匹配。
给定元素的基数只是char_count
。当元素不同时,交集的基数只是任何顺序中元素对的计数。请注意,当两个元素相同时,公式将简化为char_count
。
<强>计时强>
使用问题中的时间设置,以及我的答案的以下函数:
def root(df):
char_count = df['ID_0'].append(df.loc[df['ID_0'] != df['ID_1'], 'ID_1']).value_counts().to_dict()
pair_count = df.groupby(['ID_0', 'ID_1']).size().to_dict()
df['count'] = [char_count[x] if x == y else char_count[x] + char_count[y] - (pair_count[x,y] + pair_count.get((y,x),0)) for x,y in df[['ID_0', 'ID_1']].values]
return df
我得到n=10**4
的以下时间:
%timeit root(df.copy())
10 loops, best of 3: 25 ms per loop
%timeit df.apply(f, axis=1)
1 loop, best of 3: 49.4 s per loop
我得到n=10**6
的以下时间:
%timeit root(df.copy())
10 loops best of 3: 2.22 s per loop
看来我的解决方案大致线性缩放。