Question

我有一个数据集，如下表所示。我想单击链接按钮以根据“标签”字段进行预测。因此，我的问题是，因为我只想预测数据集的一行，如何根据sci-kit-learn中的这段代码将数据分为训练和测试集？

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state, test_size=test_size)

以下是我的观点，目的是让您了解我想做什么。

def prediction_view(request):
template='index.html'
.
.
.
train=Pull_Requests.objects.all()


    features_col = ['Comments', 'LC_added', 'LC_deleted', 'Commits', 'Changed_files', 'Evaluation_time','First_status','Reputation'] # This also test
        class_label=['Label']
    X = train[features_col].dropna() # This also test
    # y = train.Label # This also test
    y=train[class_label]

    random_state = 0
    test_size=request.POST.get('test_size')

    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state, test_size=test_size)
    clf = tree.DecisionTreeClassifier()
    clf = clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    classification_report={'accuracy':Accuracy, 'pricision':Precision, 'recall':Recall, 'f1_score':F1_meseaure}
    importance_features={'importances_feautre':importances_feautres}
    data={
        'new_data':new_data,
        'classification_report':classification_report,
        'importance_feature':importance_features,
        'features':features_col,
             }
return render(request,template,data)

Dataset sample：

Answer 1

对于交叉验证，您可以使用sklearn中的LeaveOneOut。例如：

from sklearn.model_selection import LeaveOneOut 

loo = LeaveOneOut()
loo.get_n_splits(X)

for train_index, test_index in loo.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

请注意，给定 n 个样本，这将为您提供 n 个折痕。如果 n 很大，那么这可能会在计算上变得昂贵（尽管由于功能相对较少，所以 n 可能会变得非常大）。

另一种方法是生成一个随机整数（在火车索引范围内）作为每个测试要使用的索引：

import random

max_ind = train.index[-1]
rand_int = random.randint(0, max_ind)

test_idx = pd.Index([rand_int])
train_idx = train[~test_idx]

X_train, y_train = X[train_idx], y[train_idx]
X_test, y_test = X[test_idx], y[test_idx]

这假设train的索引单调增加。您可以使用train.index.is_monotonic_increasing (docs)检查这种情况，并根据需要使用train.reset_index(drop=True) (docs)。或者，您可以改用train.shape[0]，在这种情况下，应确认索引中的每个值都是唯一的并且小于或等于train.shape[0]。

是否可以仅从我的数据集中预测一行？

1 个答案: