使用sklearn对数据集的列进行二值化后,结果不正确。代码在哪里?

时间:2018-07-03 06:56:48

标签: python machine-learning scikit-learn data-science

我预处理一个数据集。我对其中一列进行了二值化处理。二值化后,我认为值不正确。数据具有303个观测值(行)和14个要素(列)。而我要二值化的列是最后一列。

这是我的代码的一部分-

    import pandas as pd
    import numpy as np

    #importing the dataset
    header_names = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
    dataset = pd.read_csv('E:/HCU proj doc/EHR dataset/cleveland_data.csv', names= header_names)


    array = dataset.values

    # binarize num
    from sklearn.preprocessing import Binarizer
    x = array[:,13:]
    binarize = Binarizer(threshold=0.0).fit(x)
    transform_binarize = binarize.transform(x)

    array[:,13:]=transform_binarize
    print(transform_binarize)

原始数据列的外观如下-

     0,2,1,0,0.........1,0,3,1,1,2

这是上面代码的输出-

         [[0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [1.]
 [0.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]]

我认为最后一个是不正确的。我不明白为什么会这样。

1 个答案:

答案 0 :(得分:-1)

如果我正确地假设这是取自this UCI repository的心脏病数据集,而csv文件是this one,那么在这种情况下,这些是二值化器的正确值。您正在使用的原始数据列的最后一行有一个0,我想您错过了,请尝试以下代码

for idx in range(0,len(x)):
    print idx,x[idx],transform_binarize[idx]

输出

278 [1L] [1.]
279 [0L] [0.]
280 [2L] [1.]
281 [0L] [0.]
282 [3L] [1.]
283 [0L] [0.]
284 [2L] [1.]
285 [4L] [1.]
286 [2L] [1.]
287 [0L] [0.]
288 [0L] [0.]
289 [0L] [0.]
290 [1L] [1.]
291 [0L] [0.]
292 [2L] [1.]
293 [2L] [1.]
294 [1L] [1.]
295 [0L] [0.]
296 [3L] [1.]
297 [1L] [1.]
298 [1L] [1.]
299 [2L] [1.]
300 [3L] [1.]
301 [1L] [1.]
302 [0L] [0.]     #<--- I think you missed this row while reading your dataset

如果您尝试使用此代码,那么您将认为二值化器的工作完全正确。