Python Pandas fuzzywuzzy'加入'字符串列

时间:2015-11-20 14:13:41

标签: python pandas fuzzywuzzy

我正在关注question中使用fuzzywuzzy来加入'字符串列上的两个数据集。

我收到一个错误,我在排除故障时遇到了问题。

  • 错误消息似乎表明了键值问题。假设这是关于空值,我将它们过滤掉,但仍然得到相同的错误消息。

  • 这些字符串是可能有撇号,连字符,句号等的公司名称。我假设fuzzywuzzy可以处理那些不首先删除它们。

有关我应该寻找的内容的任何见解,作为解决此问题的后续步骤?

这是使用Pandas从Excel文件导入数据:

import pandas as pd
from fuzzywuzzy import fuzz
import difflib 

vendor_file = "vendor.xlsx"
spr_file = "spr.xlsx"

xl_vendor = pd.ExcelFile(vendor_file)
xl_spr = pd.ExcelFile(spr_file)

vendor1 = xl_vendor.parse(xl_vendor.sheet_names[0])
spr1 = xl_spr.parse(xl_spr.sheet_names[0])

spr = spr1[pd.notnull(spr1['Contractor'])]
vendor = vendor1[pd.notnull(vendor1['Vendor Name'])]

这是其他问题中与数据集匹配和连接的部分:

def get_spr(row):
    d = spr.apply(lambda x: fuzz.ratio(x['Vendor Name'], row['Contractor']) * 2 if row['Contractor'] == x['Vendor Name'] else 1, axis=1)
    d = d[d >= 75]
    if len(d) == 0:
        v = ['']*2
    else:
        v = spr.ix[d.idxmax(), ['Vendor Name', 'Pass/Fail']].values
    return pd.Series(v, index=['Vendor Name', 'Pass/Fail'])

# Must be unindented from function indent
pd.concat((vendor, vendor.apply(get_spr, axis=1)), axis=1)

错误是:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-21-41973cb5c3d7> in <module>()
----> 1 pd.concat((vendor, vendor.apply(get_spr, axis=1)), axis=1)

C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   3716                     if reduce is None:
   3717                         reduce = True
-> 3718                     return self._apply_standard(f, axis, reduce=reduce)
   3719             else:
   3720                 return self._apply_broadcast(f, axis)

C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in _apply_standard(self, func, axis, ignore_failures, reduce)
   3806             try:
   3807                 for i, v in enumerate(series_gen):
-> 3808                     results[i] = func(v)
   3809                     keys.append(v.name)
   3810             except Exception as e:

<ipython-input-19-62cc0c6c6daf> in get_spr(row)
      1 def get_spr(row):
----> 2     d = spr.apply(lambda x: fuzz.ratio(x['Vendor Name'], row['Contractor']) * 2 if row['Contractor'] == x['Vendor Name'] else 1, axis=1)
      3     d = d[d >= 75]
      4     if len(d) == 0:
      5         v = ['']*2

C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   3716                     if reduce is None:
   3717                         reduce = True
-> 3718                     return self._apply_standard(f, axis, reduce=reduce)
   3719             else:
   3720                 return self._apply_broadcast(f, axis)

C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in _apply_standard(self, func, axis, ignore_failures, reduce)
   3806             try:
   3807                 for i, v in enumerate(series_gen):
-> 3808                     results[i] = func(v)
   3809                     keys.append(v.name)
   3810             except Exception as e:

<ipython-input-19-62cc0c6c6daf> in <lambda>(x)
      1 def get_spr(row):
----> 2     d = spr.apply(lambda x: fuzz.ratio(x['Vendor Name'], row['Contractor']) * 2 if row['Contractor'] == x['Vendor Name'] else 1, axis=1)
      3     d = d[d >= 75]
      4     if len(d) == 0:
      5         v = ['']*2

C:\Anaconda\lib\site-packages\pandas\core\series.pyc in __getitem__(self, key)
    519     def __getitem__(self, key):
    520         try:
--> 521             result = self.index.get_value(self, key)
    522 
    523             if not np.isscalar(result):

C:\Anaconda\lib\site-packages\pandas\core\index.pyc in get_value(self, series, key)
   1607                     raise InvalidIndexError(key)
   1608                 else:
-> 1609                     raise e1
   1610             except Exception:  # pragma: no cover
   1611                 raise e1

KeyError: ('Contractor', u'occurred at index 3', u'occurred at index 0')

已编辑添加数据框列:

spr: 'Contractor', 'Pass/Fail'
vendor: 'Vendor Name'

根据davidshinn回答编辑添加更正的匹配修订:

def get_spr(row):
    d = spr.apply(lambda x: fuzz.ratio(x['Contractor'], row['Vendor Name']) * 2 if row['Vendor Name'] == x['Contractor'] else 1, axis=1)
    d = d[d >= 75]
    if len(d) == 0:
        v = ['']*2
    else:
        v = spr.ix[d.idxmax(), ['Contractor', 'Pass/Fail']].values
    return pd.Series(v, index=['Contractor', 'Pass/Fail'])

1 个答案:

答案 0 :(得分:0)

您能否提供vendorspr数据框的列名称。您确定Contractorvendor数据框中的有效列,因为这是数据框row['Contractor']正在尝试访问。