我正在尝试用我预处理过的数据集实现k最近邻算法。我将数据导入为pandas数据框,然后将其转换为numpy数组,但发生以下错误-
File "/home/user/Documents/Mooc_implementation.py", line 8, in <module>
x = num_data[:,:10]
File "/usr/lib/python2.7/dist-packages/numpy/core/records.py", line 499, in __getitem__
obj = super(recarray, self).__getitem__(indx)
IndexError: too many indices for array
这是我的代码-
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('/home/user/Documents/MOOC dataset cleaned/student_reg_vle_info_assessment.csv')
num_data = dataset.to_records(index=False)
x = num_data[:,:10]
y = num_data[:,10:11]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20)
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=4)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
我该怎么办?
dataset.head()-的输出
date_submitted date_registration date_unregistration sum_click \
0 18 -159 445 16
1 22 -53 445 4
2 30 -92 12 3
3 17 -52 445 1
4 26 -176 445 5
num_of_prev_attempts age_band region highest_education studied_credits \
0 0 0 0 0 240
1 0 1 1 0 60
2 0 1 2 1 60
3 0 1 3 1 60
4 0 2 4 2 60
score final_result
0 78 0
1 70 0
2 87 2
3 72 0
4 69 0
[Finished in 0.274s]
答案 0 :(得分:0)
在您的情况下,为什么要dataset.to_records(index=False)
,它将转换为数组,因此不能像num_data[:,:10]
那样使用它。并且也无需将dataset.to_records(index=False)
转换为train_test_split。
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
num_data = pd.read_csv('/home/user/Documents/MOOC dataset cleaned/student_reg_vle_info_assessment.csv')
# num_data = dataset.to_records(index=False)
x = num_data.iloc[:,:10]
y = num_data.iloc[:,10:11]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20)
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=4)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
答案 1 :(得分:0)
Q :为什么会出现此错误?
A :如果您有熊猫数据集并需要对其进行索引,则需要使用.iloc方法。为问题中的数据集建立索引的方式可以很好地用于numpy数组索引。
使用此:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('/home/user/Documents/MOOC dataset cleaned/student_reg_vle_info_assessment.csv')
x = dataset.iloc[:,:10]
y = dataset.iloc[:,10:11]