给出一个DataFrame,什么是在DataFrame中查找与给定值列表部分匹配的行的最佳方法。
当前,我在DataFrame(df1)中有一排给定值,我对这些值进行迭代,然后对另一个DataFrame(df2)的每一行应用一个函数,该函数计算该行中有多少符合条件的值,然后返回计数大于某个值的第二个DataFrame的子集。
def partialMatch(row, conditions):
count = 0
if(row['ResidenceZip'] == conditions['ResidenceZip']):
count+=1
if(row['FirstName'] == conditions['FirstName']):
count +=1
if(row['LastName'] == conditions['LastName']):
count +=1
if(row['Birthday'] == conditions['Birthday']):
count+=1
return count
concat_all = []
for i, row in df1.iterrows():
c = {'ResidenceZip': row['ResidenceZip'], 'FirstName':row['FirstName'],
'LastName': row['LastName'],'Birthday': row['Birthday']}
df2['count'] = df2.apply(lambda x: partialMatch(x, c), axis = 1)
x1 = df2[df2['count']>=3]
concat_all.append(x1)
这有效,但是速度很慢。有关加快此过程的任何提示?
例如,在下面的两个数据帧上运行代码,df1的第一行将返回df2的前三行,而不是后两行。
df1
FirstName|LastName | Birthday | ResidenceZip
John | Doe | 1/1/2000 | 99999
Rob | A | 1/1/2010 | 19499
df2
FirstName|LastName | Birthday | ResidenceZip | count
John | Doe | 1/1/2000 | 99999 | 3
John | Doe | 1/1/2000 | 99999 | 3
John | Doex | 1/1/2000 | 99999 | 3
Joha | Doex | 1/1/2000 | 99999 | 2
Joha | Doex | 9/9/2000 | 99999 | 1
Rob | A | 9/9/2009 | 19499 | 0
答案 0 :(得分:1)
不确定是否有办法绕过至少一个DataFrame
,但这是一个可以加快速度的选项。它确实允许对FirstName和LastName进行意外比较,尽管可以通过在值中添加唯一的前缀来避免这种比较(例如,“ @”代表姓氏,“&”代表姓氏)
import numpy as np
s1 = [set(x) for x in df1.values]
s2 = [set(x) for x in df2.values]
masks = np.reshape([len(x & y) >= 3 for x in s1 for y in s2], (len(df1), -1))
concat_all = [df2[m] for m in masks]
concat_all
[ FirstName LastName Birthday ResidenceZip
0 John Doe 1/1/2000 99999
1 John Doe 1/1/2000 99999
2 John Doex 1/1/2000 99999,
FirstName LastName Birthday ResidenceZip
5 Rob A 9/9/2009 19499]
def Alollz(df1, df2):
s1 = [set(x) for x in df1.values]
s2 = [set(x) for x in df2.values]
masks = np.reshape([len(x & y) >= 3 for x in s1 for y in s2], (len(df1), -1))
concat_all = [df2[m] for m in masks]
return concat_all
def SharpObject(df1, df2):
concat_all = []
for i, row in df1.iterrows():
c = {'ResidenceZip': row['ResidenceZip'], 'FirstName':row['FirstName'],
'LastName': row['LastName'],'Birthday': row['Birthday']}
df2['count'] = df2.apply(lambda x: partialMatch(x, c), axis = 1)
x1 = df2[df2['count']>=3]
concat_all.append(x1)
return concat_all
%timeit Alollz(df1, df2)
#785 µs ± 5.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit SharpObject(df1, df2)
#3.56 ms ± 44.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
更大:
# you should never append dfs like this in a loop
for i in range(7):
df1 = df1.append(df1)
df2 = df2.append(df2)
%timeit Alollz(df1, df2)
#132 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit SharpObject(df1, df2)
#6.88 s ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
答案 1 :(得分:1)
使用numpy isin函数:
df1_vals = df1.values
df2_vals = df2.values
df1_rows = range(df1_vals.shape[0])
concat_all = \
[df2[np.add.reduce(np.isin(df2_vals, df1_vals[row]), axis=1) >= 3] for row in df1_rows]
以下是用于设置的数据框:
df1 = pd.DataFrame({'FirstName': ['John', 'Rob'],
'LastName': ['Doe', 'A'],
'Birthday': ['1/1/2000', '9/9/2009'],
'ResidenceZip': [99999, 19499]})
df2 = pd.DataFrame({'FirstName': ['John', 'John', 'John', 'Joha', 'Joha', 'Rob'],
'LastName': ['Doe', 'Doe', 'Doex', 'Doex', 'Doex', 'A'],
'Birthday': ['1/1/2000', '1/1/2000', '1/1/2000', '1/1/2000', '9/9/2000', '9/9/2009'],
'ResidenceZip': [99999, 99999, 99999, 99999, 99999, 19499]})