我有一个数据集,如下表所示。我想单击链接按钮以根据“标签”字段进行预测。因此,我的问题是,因为我只想预测数据集的一行,如何根据sci-kit-learn中的这段代码将数据分为训练和测试集?
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state, test_size=test_size)
以下是我的观点,目的是让您了解我想做什么。
def prediction_view(request):
template='index.html'
.
.
.
train=Pull_Requests.objects.all()
features_col = ['Comments', 'LC_added', 'LC_deleted', 'Commits', 'Changed_files', 'Evaluation_time','First_status','Reputation'] # This also test
class_label=['Label']
X = train[features_col].dropna() # This also test
# y = train.Label # This also test
y=train[class_label]
random_state = 0
test_size=request.POST.get('test_size')
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state, test_size=test_size)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
classification_report={'accuracy':Accuracy, 'pricision':Precision, 'recall':Recall, 'f1_score':F1_meseaure}
importance_features={'importances_feautre':importances_feautres}
data={
'new_data':new_data,
'classification_report':classification_report,
'importance_feature':importance_features,
'features':features_col,
}
return render(request,template,data)
答案 0 :(得分:1)
对于交叉验证,您可以使用sklearn中的LeaveOneOut
。例如:
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
loo.get_n_splits(X)
for train_index, test_index in loo.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
请注意,给定 n 个样本,这将为您提供 n 个折痕。如果 n 很大,那么这可能会在计算上变得昂贵(尽管由于功能相对较少,所以 n 可能会变得非常大)。
另一种方法是生成一个随机整数(在火车索引范围内)作为每个测试要使用的索引:
import random
max_ind = train.index[-1]
rand_int = random.randint(0, max_ind)
test_idx = pd.Index([rand_int])
train_idx = train[~test_idx]
X_train, y_train = X[train_idx], y[train_idx]
X_test, y_test = X[test_idx], y[test_idx]
这假设train
的索引单调增加。您可以使用train.index.is_monotonic_increasing
(docs)检查这种情况,并根据需要使用train.reset_index(drop=True)
(docs)。或者,您可以改用train.shape[0]
,在这种情况下,应确认索引中的每个值都是唯一的并且小于或等于train.shape[0]
。