Question

我正在尝试将数据集划分为训练和测试集，但在从第45行开始的函数中我遇到了问题。运行后，程序返回“KeyError：667952”（每次运行程序时错误中的数字都不同）

1 import numpy as np
2 import pandas as pd
3 import random
4
5 data_file = pd.read_csv('loan.csv')
6
7 # variable preprocessing
8 
9 data_file['loan_status'] = np.where(data_file['loan_status'].isin(['Fully Paid', 'Current']), 1, 0)
10 loan_stat=data_file['loan_status']
11
12 m = {
13   'n/a': 0,     
14   '< 1 year': 0,
15   '1 year': 1,
16   '2 years': 2,
17   '3 years': 3,
18   '4 years': 4,
19   '5 years': 5,
20   '6 years': 6,
21   '7 years': 7,
22   '8 years': 8,
23   '9 years': 9,
24   '10+ years': 10
25 }
26 emp_length=data_file.emp_length.map(m)
27
28 annual_inc=data_file['annual_inc']
29 delinq_2yrs=data_file['delinq_2yrs']
30 dti=data_file['dti']
31 loan_amnt=data_file['loan_amnt']
32 installment=data_file['installment']
33 int_rate=data_file['int_rate']
34 total_acc=data_file['total_acc']
35 open_acc=data_file['open_acc']
36 pub_rec=data_file['pub_rec']
37 acc_now_delinq=data_file['acc_now_delinq']
38
39 #variables combined into one dataset
40
41 data_set=[annual_inc, delinq_2yrs, dti, emp_length, loan_amnt, installment, 
42 int_rate, open_acc, total_acc, acc_now_delinq, loan_stat]
43 result=pd.concat(data_set,axis=1)
44
45 def splitDataSet(x, splitRatio):
46    trainSize  = int(len(x)*splitRatio)
47    trainSet=[]
48    copy=x
49    while len(trainSet)<trainSize:
50        index=random.randrange(len(copy))
51        trainSet.append(copy.pop(index))
52    return[trainSet, copy]
53
54 splitRatio=0.67
55 train, test = splitDataSet(result, splitRatio)
56 print(train)

有人知道如何克服这个障碍吗？感谢

Answer 1

对于拆分集合，您可以编写自己的代码，但也可以使用scikit-learn。您可以使用X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

来调用它

相关文档页面位于： http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split

训练和测试分组

1 个答案: