我正在尝试将数据集划分为训练和测试集,但在从第45行开始的函数中我遇到了问题。运行后,程序返回“KeyError:667952”(每次运行程序时错误中的数字都不同)
1 import numpy as np
2 import pandas as pd
3 import random
4
5 data_file = pd.read_csv('loan.csv')
6
7 # variable preprocessing
8
9 data_file['loan_status'] = np.where(data_file['loan_status'].isin(['Fully Paid', 'Current']), 1, 0)
10 loan_stat=data_file['loan_status']
11
12 m = {
13 'n/a': 0,
14 '< 1 year': 0,
15 '1 year': 1,
16 '2 years': 2,
17 '3 years': 3,
18 '4 years': 4,
19 '5 years': 5,
20 '6 years': 6,
21 '7 years': 7,
22 '8 years': 8,
23 '9 years': 9,
24 '10+ years': 10
25 }
26 emp_length=data_file.emp_length.map(m)
27
28 annual_inc=data_file['annual_inc']
29 delinq_2yrs=data_file['delinq_2yrs']
30 dti=data_file['dti']
31 loan_amnt=data_file['loan_amnt']
32 installment=data_file['installment']
33 int_rate=data_file['int_rate']
34 total_acc=data_file['total_acc']
35 open_acc=data_file['open_acc']
36 pub_rec=data_file['pub_rec']
37 acc_now_delinq=data_file['acc_now_delinq']
38
39 #variables combined into one dataset
40
41 data_set=[annual_inc, delinq_2yrs, dti, emp_length, loan_amnt, installment,
42 int_rate, open_acc, total_acc, acc_now_delinq, loan_stat]
43 result=pd.concat(data_set,axis=1)
44
45 def splitDataSet(x, splitRatio):
46 trainSize = int(len(x)*splitRatio)
47 trainSet=[]
48 copy=x
49 while len(trainSet)<trainSize:
50 index=random.randrange(len(copy))
51 trainSet.append(copy.pop(index))
52 return[trainSet, copy]
53
54 splitRatio=0.67
55 train, test = splitDataSet(result, splitRatio)
56 print(train)
有人知道如何克服这个障碍吗? 感谢
答案 0 :(得分:1)
对于拆分集合,您可以编写自己的代码,但也可以使用scikit-learn。您可以使用X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)