Python RecordLinkage - 监督机器学习错误

时间:2018-02-22 07:04:12

标签: python record-linkage

我正在使用python Recordlinkage库构建机器学习模型,其中模型将使用预匹配数据进行训练。

以下是代码段:

urltrain = "../Training_Set.data"
namestrain = ['TrueMatchID','System','ID','Col1','Col2']

golden_pair = ps.read_csv(urltrain, names=namestrain)

golden_pair = np.asarray(golden_pair).reshape(5000,5)

golden_pair = ps.DataFrame(golden_pair)

indexer = rl.BlockIndex(on='TrueMatchID')
golden_pair_index = indexer.index(golden_pair)

print(indexer)

# Initialize the classifier
logreg = rl.LogisticRegressionClassifier()
# Train the classifier
logreg.learn(golden_pair.all(), golden_pair_index)

我收到的错误是:

KeyError:“['TrueMatchID']不在索引”

示例数据:

TrueMatchID   System     ID  Col1    Col2
12345       2            736     1111.1  1111
12345       1            736     1111.4  1111
54321       1            739     2222.3  2222
54321       2            740     2222    2222.4

代码中出现了什么问题?我对Python比较陌生,所以不确定我是否传递了一些错误的论据。

1 个答案:

答案 0 :(得分:-1)

以下是您编写的代码的注释 -

golden_pair = ps.read_csv(urltrain, names=namestrain) # pandas dataframe with column names intact

golden_pair = np.asarray(golden_pair).reshape(5000,5) # converting it to numpy array makes you lose the metadata information of pandas like column names

golden_pair = ps.DataFrame(golden_pair) # here you need to bring back the column names again as it's not present in the numpy array anymore

将最后一行修改为 -

golden_pair = ps.DataFrame(golden_pair, columns=namestrain)

您可以按原样继续使用其余代码:

indexer = rl.BlockIndex(on='TrueMatchID')
golden_pair_index = indexer.index(golden_pair)

print(indexer)

# Initialize the classifier
logreg = rl.LogisticRegressionClassifier()
# Train the classifier
logreg.learn(golden_pair.all(), golden_pair_index)

P.S 我不明白为什么你需要重塑,然后再把它再投回Dataframe。也许你可以避免这种情况。