这是我使用的代码。我正在尝试使用randomforestclassifier根据学习者和主导主体对活动进行分类。
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.datasets import make_multilabel_classification
Data = pd.read_excel("F:\VIT material\Master thesis\DATASET.xlsx",names=['learner','Dominant_Subject','Activity'])
print(Data)
print(Data.columns)
Data.reshape(Data.columns.values)
print(Data)
number= LabelEncoder()
Data['learner']= number.fit_transform(Data['learner'].astype('str'))
Data['Dominant_Subject']=number.fit_transform(Data['Dominant_Subject'].astpye('str'))
Data['Activity']= number.fit_transform(Data['Activity'].astype('str'))
print(Data)
print(Data.shape)
X = Data['learner']
print(X)
print(X.shape)
Y = Data['Dominant_Subject']
print(Y)
print(Y.shape)
print(len(X))
print(len(Y))
X_train = X[:-5]
X_test = X[-5:]
Y_train = Y[:-10]
Y_test = Y[-10:]
X_train, X_test, Y_train, Y_test=train_test_split(X,test_size=0.2,random_state=20)
print(X_train,X_test,Y_train,Y_test)
model = linear_model.LinearRegression()
model.fit(X_train,Y_train)
print(model.fit())
clf = RandomForestClassifier(n_estimators=100, min_samples_split=2)
clf.fit(X_train,Y_train)
print(clf.fit())
predicted = clf.predict(X)
print(accuracy_score(predicted,Y))
样本和标签的数量相等但是我仍然得到标签数量不等于样本数量的错误。
错误追溯: 文件“C:/ Users / RAJIV MISHRA / PycharmProjects / mltutorialpractice / 13.py”,第38行,in clf.fit(X_train,Y_train)
文件“C:\ Users \ RAJIV MISHRA \ Anaconda3 \ lib \ site-packages \ sklearn \ ensemble \ forest.py”,第326行,in fit
for i, t in enumerate(trees))
文件“C:\ Users \ RAJIV MISHRA \ Anaconda3 \ lib \ site-packages \ sklearn \ externals \ joblib \ parallel.py”,第758行, 在通话 而self.dispatch_one_batch(iterator):
文件“C:\ Users \ RAJIV MISHRA \ Anaconda3 \ lib \ site-packages \ sklearn \ externals \ joblib \ parallel.py”,第608行,在dispatch_one_batch
self._dispatch(tasks)
文件“C:\ Users \ RAJIV MISHRA \ Anaconda3 \ lib \ site-packages \ sklearn \ externals \ joblib \ parallel.py”,第571行,_dispatch
job = self._backend.apply_async(batch, callback=cb)
文件“C:\ Users \ RAJIV MISHRA \ Anaconda3 \ lib \ site-packages \ sklearn \ externals \ joblib_parallel_backends.py”,第109行,在apply_async中
result = ImmediateResult(func)
文件“C:\ Users \ RAJIV MISHRA \ Anaconda3 \ lib \ site-packages \ sklearn \ externals \ joblib_parallel_backends.py”,第326行, init
self.results = batch()
文件“C:\ Users \ RAJIV MISHRA \ Anaconda3 \ lib \ site-packages \ sklearn \ externals \ joblib \ parallel.py”,第131行,致电
return [func(*args, **kwargs) for func, args, kwargs in self.items]
文件“C:\ Users \ RAJIV MISHRA \ Anaconda3 \ lib \ site-packages \ sklearn \ externals \ joblib \ parallel.py”,第131行,
return [func(*args, **kwargs) for func, args, kwargs in self.items]
文件“C:\ Users \ RAJIV MISHRA \ Anaconda3 \ lib \ site-packages \ sklearn \ ensemble \ forest.py”,第120行,_parallel_build_trees
tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False)
文件“C:\ Users \ RAJIV MISHRA \ Anaconda3 \ lib \ site-packages \ sklearn \ tree \ tree.py”,第739行,in fit
X_idx_sorted=X_idx_sorted)
文件“C:\ Users \ RAJIV MISHRA \ Anaconda3 \ lib \ site-packages \ sklearn \ tree \ tree.py”,第240行,in fit
"number of samples=%d" % (len(y), n_samples))
ValueError:标签数= 19与样本数= 1
不匹配答案 0 :(得分:1)
在此代码中可以修复一些问题。假设X.shape [0] == Y.shape [0]: -
1.如果您使用train_test_split
X_train = X[:-5]
X_test = X[-5:]
Y_train = Y[:-10]
Y_test = Y[-10:]
代码还有另一个问题。样本索引与标签索引不匹配。我可以用以下内容来解决这个问题。
X_train = X[:-5]
X_test = X[-5:]
Y_train = Y[:-5]
Y_test = Y[-5:]
2。如果您使用train_test_split
将数据集拆分为训练集和测试集,则应传递标签和样本。
X_train, X_test, Y_train, Y_test=train_test_split(X,Y,test_size=0.2,random_state=20)