将一个数据框中的行中的多个字符串与另一个数据框中的行匹配

时间:2017-04-22 21:25:03

标签: python algorithm pandas matching

我有两个数据帧A和B. A较小,有500行,B较大,有20000行。 A的列是:

A.columns = ['name','company','model','family']

和B的列是:

B.columns = ["title", "price"]

B中的标题栏是一个大杂乱的字符串,但它确实包含A中3列的字符串,即公司,模型和家族(忘记'名称'列,因为A本身的名称是公司,模型和家族的组合)。 我需要将A中的每一行与B中的一行匹配。这是我的解决方案:

out=pd.DataFrame(columns={0,1,2,3,4,5})
out.columns=["name", 'company', 'model', 'family', 'title', 'price']

for index, row in A.iterrows():
    lst=[A.loc[index,'family'], A.loc[index,'model'], A.loc[index,'company']]
    for i, r in B.iterrows():
        if all(w in B.loc[i,'title'] for w in lst):        
            out.loc[index,'name']=A.loc[index,'name']
            out.loc[index,'company']=A.loc[index,'company']
            out.loc[index,'model']=A.loc[index,'model']
            out.loc[index,'family']=A.loc[index,'family']

            out.loc[index,'title']=B.loc[i,'title']
            out.loc[index,'price']=B.loc[i,'price']
            break

这使得工作效率非常低,需要很长时间。我知道这是一个“记录联动”问题,人们正在研究它的准确性和速度,但在熊猫中有更快更有效的方法吗?如果我只检查标题中的一个或两个项目,它会更快,但我担心它会降低准确性......

就准确性而言,我宁愿得到的比赛少于错误比赛。

此外,A.dtypes和B.dtypes显示两个数据帧的列都是对象:

title           object
price           object
dtype: object

我感谢任何评论。 感谢

********* ***********

UPDATE

这两个文件中的一部分可以在以下位置找到: A B

我做了一些清洁工作:

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.colors as mcol
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import math

A = pd.read_csv('A.txt', delimiter=',', header=None) 
A.columns = ['product name','manufacturer','model','family','announced date']

for index, row in A.iterrows():    
    A.loc[index, "product name"] = A.loc[index, "product name"].split('"')[3]
    A.loc[index, "manufacturer"] = A.loc[index, "manufacturer"].split('"')[1]
    A.loc[index, "model"] = A.loc[index, "model"].split('"')[1]
    if 'family' in A.loc[index, "family"]:
        A.loc[index, "family"] = A.loc[index, "family"].split('"')[1]
    if 'announced' in A.loc[index, "family"]:
        A.loc[index, "announced date"] = A.loc[index, "family"]
        A.loc[index, "family"] = ''
    A.loc[index, "announced date"] = A.loc[index, "announced date"].split('"')[1]

A.columns=['product name','manufacturer','model','family','announced date']
A.reset_index()

B = pd.read_csv('B.txt', error_bad_lines=False, warn_bad_lines=False, header=None) 

B.columns = ["title", "manufacturer", "currency", "price"]
pd.options.display.max_colwidth=200

for index, row in B.iterrows():
    B.loc[index,'manufacturer']=B.loc[index,'manufacturer'].split('"')[1]
    B.loc[index,'currency']=B.loc[index,'currency'].split('"')[1]
    B.loc[index,'price']=B.loc[index,'price'].split('"')[1]
    B.loc[index,'title']=B.loc[index,'title'].split('"')[3]
然后安德鲁的方法如答案所示:

def match_strs(row):
    return np.where(B.title.str.contains(row['manufacturer']) & \
                    B.title.str.contains(row['family']) & \
                    B.title.str.contains(row['model']))[0][0]

A['merge_idx'] = A.apply(match_strs, axis='columns')

(A.merge(B, left_on='merge_idx', right_on='index', right_index=True, how='right')
  .drop('merge_idx', 1)
  .dropna())
和我说的一样,发生了一些我无法弄清楚的并发症。非常感谢您的帮助

1 个答案:

答案 0 :(得分:1)

以下是一些可供使用的示例数据:

import numpy as np
import pandas as pd

# make A df
manufacturer = ['A','B','C']
model = ['foo','bar','baz']
family = ['X','Y','Z']
name = ['{}_{}_{}'.format(manufacturer[i],model[i],family[i]) for i in range(len(company))]
A = pd.DataFrame({'name':name,'manufacturer': manufacturer,'model':model,'family':family})

# A
  manufacturer family model     name
     0       A      X   foo  A_foo_X
     1       B      Y   bar  B_bar_Y
     2       C      Z   baz  C_baz_Z

# make B df
title = ['blahblahblah']
title.extend( ['{}_{}'.format(n, 'blahblahblah') for n in name] )
B = pd.DataFrame({'title':title,'price':np.random.randint(1,100,4)})

# B
   price                 title
0     62          blahblahblah
1      7  A_foo_X_blahblahblah
2     92  B_bar_Y_blahblahblah
3     24  C_baz_Z_blahblahblah

我们可以根据您的匹配条件创建一个匹配AB中的行索引的函数,并将它们存储在新列中:

def match_strs(row):
    match_result = (np.where(B.title.str.contains(row['manufacturer']) & \
                             B.title.str.contains(row['family']) & \
                             B.title.str.contains(row['model'])))
    if not len(match_result[0]):
        return None
    return match_result[0][0]

A['merge_idx'] = A.apply(match_strs, axis='columns')

然后合并AB

(A.merge(B, left_on='merge_idx', right_on='index', right_index=True, how='right')
  .drop('merge_idx', 1)
  .dropna())

输出:

  manufacturer family model     name  price                 title
     0       A      X   foo  A_foo_X     23  A_foo_X_blahblahblah
     1       B      Y   bar  B_bar_Y     14  B_bar_Y_blahblahblah
     2       C      Z   baz  C_baz_Z     19  C_baz_Z_blahblahblah

这就是你要找的东西吗?

请注意,如果您想要将行保留在B中,即使A中没有匹配项,也只需删除.dropna()末尾的merge