嗨,我是Python的新手,我不知道如何解决以下错误:
我有一个包含约200万条记录和20列商店数据的数据框,我按州对商店进行分组,并在对每个州进行训练后尝试在每个州上运行dedupe_dataframe。
这是我的代码的外观(np是numpy,dp是pandas pandas_dedupe):
#Read Store Data
stores = pd.read_csv("storefile.csv",sep = ",", encoding= 'latin1',dtype=str)
#There was /t in the first column so removing that
stores= stores.replace('\t','', regex=True)
stores= stores.replace(np.nan, '', regex=True)
#Getting a lowercase state list
states=list(stores.State.str.lower().unique())
#Grouping Data by States
state_data= {state: stores[stores.State.str.lower()==state] for state in states}
#Running De-Dupe for state Ohio ('oh')
dp.dedupe_dataframe(state_data['oh'], ['StoreBannerName','Address','City','State'])
我遇到以下错误:
importing data ...
---------------------------------------------------------------------------
KeyError
Traceback (most recent call last) <ipython-input-37-e2ed10256338> in <module>
----> 1 dp.dedupe_dataframe(state_data['oh'], ['StoreBannerName','Address','City','State'])
~\anaconda3\lib\site-packages\pandas_dedupe\dedupe_dataframe.py in
dedupe_dataframe(df, field_properties, canonicalize, config_name,
recall_weight, sample_size)
211 # train or load the model
212 deduper = _train(settings_file, training_file, data_d, field_properties,
--> 213 sample_size)
214
215 # ## Set threshold
~\anaconda3\lib\site-packages\pandas_dedupe\dedupe_dataframe.py in
_train(settings_file, training_file, data, field_properties, sample_size)
58 # To train dedupe, we feed it a sample of records.
59 sample_num = math.floor(len(data) * sample_size)
---> 60 deduper.sample(data, sample_num)
61
62 # If we have training data saved from a previous run of dedupe,
~\anaconda3\lib\site-packages\dedupe\api.py in sample(self, data,
sample_size, blocked_proportion, original_length)
836 sample_size,
837 original_length,
--> 838 index_include=examples)
839
840 self.active_learner.mark(examples, y)
~\anaconda3\lib\site-packages\dedupe\labeler.py in __init__(self,
data_model, data, blocked_proportion, sample_size, original_length,
index_include)
401 data = core.index(data)
402
--> 403 self.candidates = super().sample(data, blocked_proportion, sample_size)
404
405 random_pair = random.choice(self.candidates)
~\anaconda3\lib\site-packages\dedupe\labeler.py in sample(self, data,
blocked_proportion, sample_size)
50 return [(data[k1], data[k2])
51 for k1, k2
---> 52 in blocked_sample_keys | random_sample_keys]
53
54
~\anaconda3\lib\site-packages\dedupe\labeler.py in <listcomp>(.0)
49
50 return [(data[k1], data[k2])
---> 51 for k1, k2
52 in blocked_sample_keys | random_sample_keys]
53
KeyError: 2147550487
答案 0 :(得分:2)
交换以下行:
#Running De-Dupe for state Ohio ('oh')
dp.dedupe_dataframe(state_data['oh'], ['StoreBannerName','Address','City','State'])
有
#Running De-Dupe for state Ohio ('oh')
state_data['oh'].dedupe_dataframe(subset = ['StoreBannerName','Address','City','State'], keep='first')