我正在使用python Recordlinkage库构建机器学习模型,其中模型将使用预匹配数据进行训练。
以下是代码段:
urltrain = "../Training_Set.data"
namestrain = ['TrueMatchID','System','ID','Col1','Col2']
golden_pair = ps.read_csv(urltrain, names=namestrain)
golden_pair = np.asarray(golden_pair).reshape(5000,5)
golden_pair = ps.DataFrame(golden_pair)
indexer = rl.BlockIndex(on='TrueMatchID')
golden_pair_index = indexer.index(golden_pair)
print(indexer)
# Initialize the classifier
logreg = rl.LogisticRegressionClassifier()
# Train the classifier
logreg.learn(golden_pair.all(), golden_pair_index)
我收到的错误是:
KeyError:“['TrueMatchID']不在索引”
中示例数据:
TrueMatchID System ID Col1 Col2
12345 2 736 1111.1 1111
12345 1 736 1111.4 1111
54321 1 739 2222.3 2222
54321 2 740 2222 2222.4
代码中出现了什么问题?我对Python比较陌生,所以不确定我是否传递了一些错误的论据。
答案 0 :(得分:-1)
以下是您编写的代码的注释 -
golden_pair = ps.read_csv(urltrain, names=namestrain) # pandas dataframe with column names intact
golden_pair = np.asarray(golden_pair).reshape(5000,5) # converting it to numpy array makes you lose the metadata information of pandas like column names
golden_pair = ps.DataFrame(golden_pair) # here you need to bring back the column names again as it's not present in the numpy array anymore
将最后一行修改为 -
golden_pair = ps.DataFrame(golden_pair, columns=namestrain)
您可以按原样继续使用其余代码:
indexer = rl.BlockIndex(on='TrueMatchID')
golden_pair_index = indexer.index(golden_pair)
print(indexer)
# Initialize the classifier
logreg = rl.LogisticRegressionClassifier()
# Train the classifier
logreg.learn(golden_pair.all(), golden_pair_index)
P.S 我不明白为什么你需要重塑,然后再把它再投回Dataframe
。也许你可以避免这种情况。