Python Dedupe软件包错误:“记录未与数据模型对齐”。但是一切看起来都还可以

时间:2019-07-01 23:26:35

标签: python python-dedupe

我正在在线学习有关python重复数据删除的各种教程,但是无论遇到什么错误,我都会不断遇到此错误:

ValueError: Records do not line up with data model. The field 'firstname ' is in data_model but not in a record

他们在github上的某个人有相同的问题:https://github.com/dedupeio/csvdedupe/issues/55,而开发人员说,训练示例必须具有此错误消息中的任何记录。

我的数据有firstname条记录,字段变量也是如此。

要删除重复数据:


{76550: {'id': '76550',
  'title': 'mrs',
  'firstname': 'mary',
  'lastname': 'fakename',
  'email': 'fakemail@yahoo.com',
  'phone': None,
  'mobile': '353870748',
   etc etc etc}

这是字段:


fields = [
        {'field' : 'firstname ', 'type': 'String','has missing' : True},
        {'field' : 'lastname ', 'type': 'String','has missing' : True},
        {'field' : 'email', 'type': 'String','has missing' : True},
        {'field' : 'address1', 'type': 'String', 'has missing' : True},
        {'field' : 'mobile', 'type': 'String', 'has missing' : True},
        ]

错误在这里引起:


# Pass in our model
deduper = dedupe.Dedupe(fields)

# Feed some sample data in ... 1500 records
deduper.sample(df, 1500)

ValueError                                Traceback (most recent call last)
<ipython-input-89-e34caa52a74c> in <module>
      2 
      3 # Feed some sample data in ... 15000 records
----> 4 deduper.sample(df, 1500)

~\Anaconda3\envs\Tensorflow\lib\site-packages\dedupe\api.py in sample(self, data, sample_size, blocked_proportion, original_length)
    789                                a sample of full data
    790         '''
--> 791         self._checkData(data)
    792 
    793         self.active_learner = self.ActiveLearner(self.data_model,

~\Anaconda3\envs\Tensorflow\lib\site-packages\dedupe\api.py in _checkData(self, data)
    802                 'Dictionary of records is empty.')
    803 
--> 804         self.data_model.check(next(iter(viewvalues(data))))
    805 
    806 

~\Anaconda3\envs\Tensorflow\lib\site-packages\dedupe\datamodel.py in check(self, record)
    119                 raise ValueError("Records do not line up with data model. "
    120                                  "The field '%s' is in data_model but not "
--> 121                                  "in a record" % field)
    122 
    123 

ValueError: Records do not line up with data model. The field 'firstname ' is in data_model but not in a record

两个都有firstname

我要去哪里错了?

我尝试过各种方式转换数据帧并将其转换为dict。我无法正常工作。

1 个答案:

答案 0 :(得分:1)

问题在于,在字段定义中您有多余的空间

您想要

'firstname'

不是

'firstname '