我是python的新手。所以也许这里确实缺少一些基本的东西,但是我无法弄清楚……对于我的工作,我试图读取一个txt文件并在其上应用KNN。
文件内容如下,它具有三列,第三列是类,分隔符是一个空格。
0.85 17.45 2
0.75 15.6 2
3.3 15.45 2
5.25 14.2 2
4.9 15.65 2
5.35 15.85 2
5.1 17.9 2
4.6 18.25 2
4.05 18.75 2
3.4 19.7 2
2.9 21.15 2
3.1 21.85 2
3.9 21.85 2
4.4 20.05 2
7.2 14.5 2
7.65 16.5 2
7.1 18.65 2
7.05 19.9 2
5.85 20.55 2
5.5 21.8 2
6.55 21.8 2
6.05 22.3 2
5.2 23.4 2
4.55 23.9 2
5.1 24.4 2
8.1 26.35 2
10.15 27.7 2
9.75 25.5 2
9.2 21.1 2
11.2 22.8 2
12.6 23.1 2
13.25 23.5 2
11.65 26.85 2
12.45 27.55 2
13.3 27.85 2
13.7 27.75 2
14.15 26.9 2
14.05 26.55 2
15.15 24.2 2
15.2 24.75 2
12.2 20.9 2
12.15 21.45 2
12.75 22.05 2
13.15 21.85 2
13.75 22 2
13.95 22.7 2
14.4 22.65 2
14.2 22.15 2
14.1 21.75 2
14.05 21.4 2
17.2 24.8 2
17.7 24.85 2
17.55 25.2 2
17 26.85 2
16.55 27.1 2
19.15 25.35 2
18.8 24.7 2
21.4 25.85 2
15.8 21.35 2
16.6 21.15 2
17.45 20.75 2
18 20.95 2
18.25 20.2 2
18 22.3 2
18.6 22.25 2
19.2 21.95 2
19.45 22.1 2
20.1 21.6 2
20.1 20.9 2
19.9 20.35 2
19.45 19.05 2
19.25 18.7 2
21.3 22.3 2
22.9 23.65 2
23.15 24.1 2
24.25 22.85 2
22.05 20.25 2
20.95 18.25 2
21.65 17.25 2
21.55 16.7 2
21.6 16.3 2
21.5 15.5 2
22.4 16.5 2
22.25 18.1 2
23.15 19.05 2
23.5 19.8 2
23.75 20.2 2
25.15 19.8 2
25.5 19.45 2
23 18 2
23.95 17.75 2
25.9 17.55 2
27.65 15.65 2
23.1 14.6 2
23.5 15.2 2
24.05 14.9 2
24.5 14.7 2
14.15 17.35 1
14.3 16.8 1
14.3 15.75 1
14.75 15.1 1
15.35 15.5 1
15.95 16.45 1
16.5 17.05 1
17.35 17.05 1
17.15 16.3 1
16.65 16.1 1
16.5 15.15 1
16.25 14.95 1
16 14.25 1
15.9 13.2 1
15.15 12.05 1
15.2 11.7 1
17 15.65 1
16.9 15.35 1
17.35 15.45 1
17.15 15.1 1
17.3 14.9 1
17.7 15 1
17 14.6 1
16.85 14.3 1
16.6 14.05 1
17.1 14 1
17.45 14.15 1
17.8 14.2 1
17.6 13.85 1
17.2 13.5 1
17.25 13.15 1
17.1 12.75 1
16.95 12.35 1
16.5 12.2 1
16.25 12.5 1
16.05 11.9 1
16.65 10.9 1
16.7 11.4 1
16.95 11.25 1
17.3 11.2 1
18.05 11.9 1
18.6 12.5 1
18.9 12.05 1
18.7 11.25 1
17.95 10.9 1
18.4 10.05 1
17.45 10.4 1
17.6 10.15 1
17.7 9.85 1
17.3 9.7 1
16.95 9.7 1
16.75 9.65 1
19.8 9.95 1
19.1 9.55 1
17.5 8.3 1
17.55 8.1 1
17.85 7.55 1
18.2 8.35 1
19.3 9.1 1
19.4 8.85 1
19.05 8.85 1
18.9 8.5 1
18.6 7.85 1
18.7 7.65 1
19.35 8.2 1
19.95 8.3 1
20 8.9 1
20.3 8.9 1
20.55 8.8 1
18.35 6.95 1
18.65 6.9 1
19.3 7 1
19.1 6.85 1
19.15 6.65 1
21.2 8.8 1
21.4 8.8 1
21.1 8 1
20.4 7 1
20.5 6.35 1
20.1 6.05 1
20.45 5.15 1
20.95 5.55 1
20.95 6.2 1
20.9 6.6 1
21.05 7 1
21.85 8.5 1
21.9 8.2 1
22.3 7.7 1
21.85 6.65 1
21.3 5.05 1
22.6 6.7 1
22.5 6.15 1
23.65 7.2 1
24.1 7 1
21.95 4.8 1
22.15 5.05 1
22.45 5.3 1
22.45 4.9 1
22.7 5.5 1
23 5.6 1
23.2 5.3 1
23.45 5.95 1
23.75 5.95 1
24.45 6.15 1
24.6 6.45 1
25.2 6.55 1
26.05 6.4 1
25.3 5.75 1
24.35 5.35 1
23.3 4.9 1
22.95 4.75 1
22.4 4.55 1
22.8 4.1 1
22.9 4 1
23.25 3.85 1
23.45 3.6 1
23.55 4.2 1
23.8 3.65 1
23.8 4.75 1
24.2 4 1
24.55 4 1
24.7 3.85 1
24.7 4.3 1
24.9 4.75 1
26.4 5.7 1
27.15 5.95 1
27.3 5.45 1
27.5 5.45 1
27.55 5.1 1
26.85 4.95 1
26.6 4.9 1
26.85 4.4 1
26.2 4.4 1
26 4.25 1
25.15 4.1 1
25.6 3.9 1
25.85 3.6 1
24.95 3.35 1
25.1 3.25 1
25.45 3.15 1
26.85 2.95 1
27.15 3.15 1
27.2 3 1
27.95 3.25 1
27.95 3.5 1
28.8 4.05 1
28.8 4.7 1
28.75 5.45 1
28.6 5.75 1
29.25 6.3 1
30 6.55 1
30.6 3.4 1
30.05 3.45 1
29.75 3.45 1
29.2 4 1
29.45 4.05 1
29.05 4.55 1
29.4 4.85 1
29.5 4.7 1
29.9 4.45 1
30.75 4.45 1
30.4 4.05 1
30.8 3.95 1
31.05 3.95 1
30.9 5.2 1
30.65 5.85 1
30.7 6.15 1
31.5 6.25 1
31.65 6.55 1
32 7 1
32.5 7.95 1
33.35 7.45 1
32.6 6.95 1
32.65 6.6 1
32.55 6.35 1
32.35 6.1 1
32.55 5.8 1
32.2 5.05 1
32.35 4.25 1
32.9 4.15 1
32.7 4.6 1
32.75 4.85 1
34.1 4.6 1
34.1 5 1
33.6 5.25 1
33.35 5.65 1
33.75 5.95 1
33.4 6.2 1
34.45 5.8 1
34.65 5.65 1
34.65 6.25 1
35.25 6.25 1
34.35 6.8 1
34.1 7.15 1
34.45 7.3 1
34.7 7.2 1
34.85 7 1
34.35 7.75 1
34.55 7.85 1
35.05 8 1
35.5 8.05 1
35.8 7.1 1
36.6 6.7 1
36.75 7.25 1
36.5 7.4 1
35.95 7.9 1
36.1 8.1 1
36.15 8.4 1
37.6 7.35 1
37.9 7.65 1
29.15 4.4 1
34.9 9 1
35.3 9.4 1
35.9 9.35 1
36 9.65 1
35.75 10 1
36.7 9.15 1
36.6 9.8 1
36.9 9.75 1
37.25 10.15 1
36.4 10.15 1
36.3 10.7 1
36.75 10.85 1
38.15 9.7 1
38.4 9.45 1
38.35 10.5 1
37.7 10.8 1
37.45 11.15 1
37.35 11.4 1
37 11.75 1
36.8 12.2 1
37.15 12.55 1
37.25 12.15 1
37.65 11.95 1
37.95 11.85 1
38.6 11.75 1
38.5 12.2 1
38 12.95 1
37.3 13 1
37.5 13.4 1
37.85 14.5 1
38.3 14.6 1
38.05 14.45 1
38.35 14.35 1
38.5 14.25 1
39.3 14.2 1
39 13.2 1
38.95 12.9 1
39.2 12.35 1
39.5 11.8 1
39.55 12.3 1
39.75 12.75 1
40.2 12.8 1
40.4 12.05 1
40.45 12.5 1
40.55 13.15 1
40.45 14.5 1
40.2 14.8 1
40.65 14.9 1
40.6 15.25 1
41.3 15.3 1
40.95 15.7 1
41.25 16.8 1
40.95 17.05 1
40.7 16.45 1
40.45 16.3 1
39.9 16.2 1
39.65 16.2 1
39.25 15.5 1
38.85 15.5 1
38.3 16.5 1
38.75 16.85 1
39 16.6 1
38.25 17.35 1
39.5 16.95 1
39.9 17.05 1
我的代码:
import csv
import random
import math
import operator
def loadDataset(filename, split, trainingSet=[] , testSet=[]):
with open(filename, 'rb') as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)-1):
for y in range(3):
dataset[x][y] = float(dataset[x][y])
if random.random() < split:
trainingSet.append(dataset[x])
else:
testSet.append(dataset[x])
def euclideanDistance(instance1, instance2, length):
distance = 0
for x in range(length):
distance += pow((instance1[x] - instance2[x]), 2)
return math.sqrt(distance)
def getNeighbors(trainingSet, testInstance, k):
distances = []
length = len(testInstance)-1
for x in range(len(trainingSet)):
dist = euclideanDistance(testInstance, trainingSet[x], length)
distances.append((trainingSet[x], dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(k):
neighbors.append(distances[x][0])
return neighbors
def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):
response = neighbors[x][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1
sortedVotes = sorted(classVotes.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedVotes[0][0]
def getAccuracy(testSet, predictions):
correct = 0
for x in range(len(testSet)):
if testSet[x][-1] == predictions[x]:
correct += 1
return (correct/float(len(testSet))) * 100.0
def main():
# prepare data
trainingSet=[]
testSet=[]
split = 0.67
loadDataset('Jain.txt', split, trainingSet, testSet)
print 'Train set: ' + repr(len(trainingSet))
print 'Test set: ' + repr(len(testSet))
# generate predictions
predictions=[]
k = 3
for x in range(len(testSet)):
neighbors = getNeighbors(trainingSet, testSet[x], k)
result = getResponse(neighbors)
predictions.append(result)
print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1]))
accuracy = getAccuracy(testSet, predictions)
print('Accuracy: ' + repr(accuracy) + '%')
main()
答案 0 :(得分:0)
这里:
lines = csv.reader(csvfile)
您必须tell csv.reader what separator to use-否则它将使用默认的excel','分隔符。请注意,在您发布的示例中,分隔符实际上可能不是“空格”,而是制表符(python中的{"\t"
)或随机数的空格-在这种情况下,它不是类似于csv的格式并且您必须自己解析行。
您的代码也远非pythonic。首先,第一件事:python的“ for”循环实际上是“针对每种”循环,即它们直接从您迭代的对象中产生值。迭代列表的正确方法是:
lst = ["a", "b", "c"]
for item in lst:
print(item)
因此这里不需要range()
和索引访问。请注意,如果您也想拥有索引,则可以使用enumerate(sequence)
,它会产生(index, item)
对,即:
lst = ["a", "b", "c"]
for index, item in enumerate(lst):
print("item at {} is {}".format(index, item))
因此您的loadDataset()函数可以重写为:
def loadDataset(filename, split, trainingSet=None , testSet=None):
# fix the mutable default argument gotcha
# cf https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments
if trainingSet is None:
trainingSet = []
if testSet is None:
testSet = []
with open(filename, 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter="\t")
for row in reader:
row = tuple(float(x) for x in row)
if random.random() < split:
trainingSet.append(row)
else:
testSet.append(row)
# so the caller can get the values back
return trainingSet, testSet
请注意,如果文件中的任何值都不是浮点数的正确表示形式,您仍将在ValueError
中得到一个row = tuple(float(x) for x in row)
。此处的解决方案是捕获错误并以一种或另一种方式处理错误-通过使用附加的调试信息(错误的值以及错误所属的文件的哪一行)重新引发错误,或者记录错误并忽略此行,或者在您的应用程序/ lib的上下文中有意义:
for row in reader:
try:
row = tuple(float(x) for x in row)
except ValueError as e:
# here we choose to just log the error
# and ignore the row, but you may want
# to do otherwise, your choice...
print("wrong value in line {}: {}".format(reader.line_num, row))
continue
if random.random() < split:
trainingSet.append(row)
else:
testSet.append(row)
此外,如果要并行地遍历两个列表(获取“ list1 [x],list2 [x]”对),you can use zip()
:
lst1 = ["a", "b", "c"]
lst2 = ["x", "y", "z"]
for pair in zip(lst1, lst2):
print(pair)
并且有一些函数可以从迭代器中获取sum()
的值,即:
lst = [1、2、3] 打印(sum(lst))
因此您的euclideanDistance
函数可以重写为:
def euclideanDistance(instance1, instance2, length):
pairs = zip(instance1[:length], instance2[:length])
return math.sqrt(sum(pow(x - y) for x, y in pairs))
等等等...