来自scratch python的{KNN算法

时间:2017-05-02 10:33:03

标签: python algorithm pandas machine-learning knn

我正在尝试从头开始执行KNN算法,但我收到一个非常奇怪的错误,说“KeyError:0”

我认为这意味着我在某处有一个空字典,但我不明白这是怎么回事。为了清楚起见,我可能只是添加黑盒KNN算法中的数据工作正常,所以它必须是代码中的东西...

这是我的代码:

import numpy as np
import pandas as pd
import csv
import scipy.stats as stats
import math
from collections import Counter
import operator
from operator import itemgetter


"""Training features dataset"""
filenametrain_data = 'training_data.csv'
training_feature_set = pd.read_csv(filenametrain_data, header=None, usecols=range(1,13627))

"""Training labels dataset"""
filenametrain_label = 'training_labels.csv'
training_feature_label = pd.read_csv(filenametrain_label, header=None, usecols=[1], names=['Category'])

"""Split into training and testing datasets 90%/10%"""
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(training_feature_set, training_feature_label, test_size = 0.1, random_state=42)


"""KNN Model"""
def distance(X_train, y_train):
    dist = 0.0
    for i in range(len(X_train)):
        dist += pow((X_train[i] - y_train[i]), 2)
    return math.sqrt(dist)

def getNeighbors(X_train, y_train, X_test, k):
    distances = []
    for i in range(len(X_train)):
        dist = distance(X_test, X_train[i])
        distances.append((X_train[i], dist, y_train[i]))
    distances.sort(key=operator.itemgetter(1))
    neighbor = []
    for elem in range(k):
        neighbor.append((distances[elem][0], distances[elem][2]))
    return neighbor

def getResponse(neighbors):
    classVotes = {}
    for x in range(len(neighbors)):
        response = int(neighbors[x][-1])
        if response in classVotes:
            classVotes[response] += 1
        else:
            classVotes[response] = 1
    sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse = True)
    return sortedVotes[0][0]

"""Prediction"""    
predictions = []
k = 4
for x in range(len(X_test)):
    neighbors = getNeighbors(X_train, y_train, y_test[x], k)
    result = getResponse(neighbors)
    predictions.append(result)   

返回的错误是:

  

追踪(最近一次呼叫最后一次):

     

文件“”,第2行,in       neighbors = getNeighbors(X_train,y_train,y_test [x],k)

     

文件“C:\ ANACONDA \ lib \ site-packages \ pandas \ core \ frame.py”,行   1797年,在 getitem       return self._getitem_column(key)

     

文件“C:\ ANACONDA \ lib \ site-packages \ pandas \ core \ frame.py”,行   1804,在_getitem_column中       return self._get_item_cache(key)

     

文件“C:\ ANACONDA \ lib \ site-packages \ pandas \ core \ generic.py”,行   1084,在_get_item_cache中       values = self._data.get(item)

     

文件“C:\ ANACONDA \ lib \ site-packages \ pandas \ core \ internals.py”,行   2851,在得到       loc = self.items.get_loc(item)

     

文件“C:\ ANACONDA \ lib \ site-packages \ pandas \ core \ index.py”,行   1572年,在get_loc中       return self._engine.get_loc(_values_from_object(key))

     

文件“pandas \ index.pyx”,第134行,in   pandas.index.IndexEngine.get_loc(pandas \ index.c:3824)

     

文件“pandas \ index.pyx”,第154行,in   pandas.index.IndexEngine.get_loc(pandas \ index.c:3704)

     

文件“pandas \ hashtable.pyx”,第686行,in   pandas.hashtable.PyObjectHashTable.get_item(pandas \ hashtable.c:12280)

     

文件“pandas \ hashtable.pyx”,第694行,in   pandas.hashtable.PyObjectHashTable.get_item(pandas \ hashtable.c:12231)

     

KeyError:0

可以访问数据集here

1 个答案:

答案 0 :(得分:0)

编辑:您可能在csv文件的开头有一个额外的字符。尝试在read_csv()调用中指定编码。请参阅"编码"在http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

  

编码:str,默认无编码用于UTF时   读/写(例如'utf-8')。 Python标准编码列表:   https://docs.python.org/3/library/codecs.html#standard-encodings

当你不需要一个圆点时,你可以使用一个圆点(在两个地方,我可以立即看到):

operator.itemgetter(1)

您已经专门导入了itemgetter:

from operator import itemgetter

因此,当您调用itemgetter时,只需在没有点表示法的情况下调用它:

itemgetter(1)