我正在关注question中使用fuzzywuzzy来加入'字符串列上的两个数据集。
我收到一个错误,我在排除故障时遇到了问题。
错误消息似乎表明了键值问题。假设这是关于空值,我将它们过滤掉,但仍然得到相同的错误消息。
这些字符串是可能有撇号,连字符,句号等的公司名称。我假设fuzzywuzzy可以处理那些不首先删除它们。
有关我应该寻找的内容的任何见解,作为解决此问题的后续步骤?
这是使用Pandas从Excel文件导入数据:
import pandas as pd
from fuzzywuzzy import fuzz
import difflib
vendor_file = "vendor.xlsx"
spr_file = "spr.xlsx"
xl_vendor = pd.ExcelFile(vendor_file)
xl_spr = pd.ExcelFile(spr_file)
vendor1 = xl_vendor.parse(xl_vendor.sheet_names[0])
spr1 = xl_spr.parse(xl_spr.sheet_names[0])
spr = spr1[pd.notnull(spr1['Contractor'])]
vendor = vendor1[pd.notnull(vendor1['Vendor Name'])]
这是其他问题中与数据集匹配和连接的部分:
def get_spr(row):
d = spr.apply(lambda x: fuzz.ratio(x['Vendor Name'], row['Contractor']) * 2 if row['Contractor'] == x['Vendor Name'] else 1, axis=1)
d = d[d >= 75]
if len(d) == 0:
v = ['']*2
else:
v = spr.ix[d.idxmax(), ['Vendor Name', 'Pass/Fail']].values
return pd.Series(v, index=['Vendor Name', 'Pass/Fail'])
# Must be unindented from function indent
pd.concat((vendor, vendor.apply(get_spr, axis=1)), axis=1)
错误是:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-21-41973cb5c3d7> in <module>()
----> 1 pd.concat((vendor, vendor.apply(get_spr, axis=1)), axis=1)
C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
3716 if reduce is None:
3717 reduce = True
-> 3718 return self._apply_standard(f, axis, reduce=reduce)
3719 else:
3720 return self._apply_broadcast(f, axis)
C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in _apply_standard(self, func, axis, ignore_failures, reduce)
3806 try:
3807 for i, v in enumerate(series_gen):
-> 3808 results[i] = func(v)
3809 keys.append(v.name)
3810 except Exception as e:
<ipython-input-19-62cc0c6c6daf> in get_spr(row)
1 def get_spr(row):
----> 2 d = spr.apply(lambda x: fuzz.ratio(x['Vendor Name'], row['Contractor']) * 2 if row['Contractor'] == x['Vendor Name'] else 1, axis=1)
3 d = d[d >= 75]
4 if len(d) == 0:
5 v = ['']*2
C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
3716 if reduce is None:
3717 reduce = True
-> 3718 return self._apply_standard(f, axis, reduce=reduce)
3719 else:
3720 return self._apply_broadcast(f, axis)
C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in _apply_standard(self, func, axis, ignore_failures, reduce)
3806 try:
3807 for i, v in enumerate(series_gen):
-> 3808 results[i] = func(v)
3809 keys.append(v.name)
3810 except Exception as e:
<ipython-input-19-62cc0c6c6daf> in <lambda>(x)
1 def get_spr(row):
----> 2 d = spr.apply(lambda x: fuzz.ratio(x['Vendor Name'], row['Contractor']) * 2 if row['Contractor'] == x['Vendor Name'] else 1, axis=1)
3 d = d[d >= 75]
4 if len(d) == 0:
5 v = ['']*2
C:\Anaconda\lib\site-packages\pandas\core\series.pyc in __getitem__(self, key)
519 def __getitem__(self, key):
520 try:
--> 521 result = self.index.get_value(self, key)
522
523 if not np.isscalar(result):
C:\Anaconda\lib\site-packages\pandas\core\index.pyc in get_value(self, series, key)
1607 raise InvalidIndexError(key)
1608 else:
-> 1609 raise e1
1610 except Exception: # pragma: no cover
1611 raise e1
KeyError: ('Contractor', u'occurred at index 3', u'occurred at index 0')
已编辑添加数据框列:
spr: 'Contractor', 'Pass/Fail'
vendor: 'Vendor Name'
根据davidshinn回答编辑添加更正的匹配修订:
def get_spr(row):
d = spr.apply(lambda x: fuzz.ratio(x['Contractor'], row['Vendor Name']) * 2 if row['Vendor Name'] == x['Contractor'] else 1, axis=1)
d = d[d >= 75]
if len(d) == 0:
v = ['']*2
else:
v = spr.ix[d.idxmax(), ['Contractor', 'Pass/Fail']].values
return pd.Series(v, index=['Contractor', 'Pass/Fail'])
答案 0 :(得分:0)
您能否提供vendor
和spr
数据框的列名称。您确定Contractor
是vendor
数据框中的有效列,因为这是数据框row['Contractor']
正在尝试访问。