我有两个数据帧A和B. A较小,有500行,B较大,有20000行。 A的列是:
A.columns = ['name','company','model','family']
和B的列是:
B.columns = ["title", "price"]
B中的标题栏是一个大杂乱的字符串,但它确实包含A中3列的字符串,即公司,模型和家族(忘记'名称'列,因为A本身的名称是公司,模型和家族的组合)。 我需要将A中的每一行与B中的一行匹配。这是我的解决方案:
out=pd.DataFrame(columns={0,1,2,3,4,5})
out.columns=["name", 'company', 'model', 'family', 'title', 'price']
for index, row in A.iterrows():
lst=[A.loc[index,'family'], A.loc[index,'model'], A.loc[index,'company']]
for i, r in B.iterrows():
if all(w in B.loc[i,'title'] for w in lst):
out.loc[index,'name']=A.loc[index,'name']
out.loc[index,'company']=A.loc[index,'company']
out.loc[index,'model']=A.loc[index,'model']
out.loc[index,'family']=A.loc[index,'family']
out.loc[index,'title']=B.loc[i,'title']
out.loc[index,'price']=B.loc[i,'price']
break
这使得工作效率非常低,需要很长时间。我知道这是一个“记录联动”问题,人们正在研究它的准确性和速度,但在熊猫中有更快更有效的方法吗?如果我只检查标题中的一个或两个项目,它会更快,但我担心它会降低准确性......
就准确性而言,我宁愿得到的比赛少于错误比赛。
此外,A.dtypes和B.dtypes显示两个数据帧的列都是对象:
title object
price object
dtype: object
我感谢任何评论。 感谢
********* ***********
UPDATE我做了一些清洁工作:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.colors as mcol
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import math
A = pd.read_csv('A.txt', delimiter=',', header=None)
A.columns = ['product name','manufacturer','model','family','announced date']
for index, row in A.iterrows():
A.loc[index, "product name"] = A.loc[index, "product name"].split('"')[3]
A.loc[index, "manufacturer"] = A.loc[index, "manufacturer"].split('"')[1]
A.loc[index, "model"] = A.loc[index, "model"].split('"')[1]
if 'family' in A.loc[index, "family"]:
A.loc[index, "family"] = A.loc[index, "family"].split('"')[1]
if 'announced' in A.loc[index, "family"]:
A.loc[index, "announced date"] = A.loc[index, "family"]
A.loc[index, "family"] = ''
A.loc[index, "announced date"] = A.loc[index, "announced date"].split('"')[1]
A.columns=['product name','manufacturer','model','family','announced date']
A.reset_index()
B = pd.read_csv('B.txt', error_bad_lines=False, warn_bad_lines=False, header=None)
B.columns = ["title", "manufacturer", "currency", "price"]
pd.options.display.max_colwidth=200
for index, row in B.iterrows():
B.loc[index,'manufacturer']=B.loc[index,'manufacturer'].split('"')[1]
B.loc[index,'currency']=B.loc[index,'currency'].split('"')[1]
B.loc[index,'price']=B.loc[index,'price'].split('"')[1]
B.loc[index,'title']=B.loc[index,'title'].split('"')[3]
然后安德鲁的方法如答案所示:
def match_strs(row):
return np.where(B.title.str.contains(row['manufacturer']) & \
B.title.str.contains(row['family']) & \
B.title.str.contains(row['model']))[0][0]
A['merge_idx'] = A.apply(match_strs, axis='columns')
(A.merge(B, left_on='merge_idx', right_on='index', right_index=True, how='right')
.drop('merge_idx', 1)
.dropna())
和我说的一样,发生了一些我无法弄清楚的并发症。非常感谢您的帮助
答案 0 :(得分:1)
以下是一些可供使用的示例数据:
import numpy as np
import pandas as pd
# make A df
manufacturer = ['A','B','C']
model = ['foo','bar','baz']
family = ['X','Y','Z']
name = ['{}_{}_{}'.format(manufacturer[i],model[i],family[i]) for i in range(len(company))]
A = pd.DataFrame({'name':name,'manufacturer': manufacturer,'model':model,'family':family})
# A
manufacturer family model name
0 A X foo A_foo_X
1 B Y bar B_bar_Y
2 C Z baz C_baz_Z
# make B df
title = ['blahblahblah']
title.extend( ['{}_{}'.format(n, 'blahblahblah') for n in name] )
B = pd.DataFrame({'title':title,'price':np.random.randint(1,100,4)})
# B
price title
0 62 blahblahblah
1 7 A_foo_X_blahblahblah
2 92 B_bar_Y_blahblahblah
3 24 C_baz_Z_blahblahblah
我们可以根据您的匹配条件创建一个匹配A
和B
中的行索引的函数,并将它们存储在新列中:
def match_strs(row):
match_result = (np.where(B.title.str.contains(row['manufacturer']) & \
B.title.str.contains(row['family']) & \
B.title.str.contains(row['model'])))
if not len(match_result[0]):
return None
return match_result[0][0]
A['merge_idx'] = A.apply(match_strs, axis='columns')
然后合并A
和B
:
(A.merge(B, left_on='merge_idx', right_on='index', right_index=True, how='right')
.drop('merge_idx', 1)
.dropna())
输出:
manufacturer family model name price title
0 A X foo A_foo_X 23 A_foo_X_blahblahblah
1 B Y bar B_bar_Y 14 B_bar_Y_blahblahblah
2 C Z baz C_baz_Z 19 C_baz_Z_blahblahblah
这就是你要找的东西吗?
请注意,如果您想要将行保留在B中,即使A中没有匹配项,也只需删除.dropna()
末尾的merge
。