我有一个用于机器学习的数据集。但是,我想将数据集分为训练集和测试集。我的培训应包括9月份之前发放的所有贷款。我的测试将包括其余部分(例如,四月至十月,十一月,十二月)。如何以我认为最适合此任务的方式准备数据集?
issue_d int_rate installment dti revol_bal revol_util inq_last_6mths delinq_2yrs pub_rec loan_status purpose_credit_card purpose_debt_consolidation purpose_home_improvement purpose_house purpose_major_purchase purpose_medical purpose_moving purpose_other purpose_renewable_energy purpose_small_business purpose_vacation purpose_wedding
11 Mar-2018 14.07% 233.05 24.69 707 15.7% 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0
16 Mar-2018 11.98% 232.44 20.25 5004 36% 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0
17 Mar-2018 26.77% 607.97 24.40 7364 46% 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
20 Mar-2018 20.39% 560.94 15.76 14591 34.2% 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0
23 Mar-2018 7.34% 930.99 16.18 755 0% 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0
...
130741 Apr-2018 6.07% 309.85 14.64 17380 24.5% 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0
130742 Apr-2018 11.98% 555.86 21.05 19591 20.5% 2 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0
130744 Apr-2018 11.98% 215.84 14.68 4707 37.7% 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0
issue_d的类型为object
。
到目前为止,我并不担心使用以下日期:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=123, stratify=y)
您可以下载CSV here(2018年的银行贷款。它们分为四个季度)。可以通过以下方式使用Python 3:
import pandas as pd
# Control delimiters, rows, column names with read_csv (see later)
data_Q1 = pd.read_csv("LoanStats_2018Q1.csv", skiprows=1, skipfooter=2, engine='python')
data_Q2 = pd.read_csv("LoanStats_2018Q2.csv", skiprows=1, skipfooter=2, engine='python')
data_Q3 = pd.read_csv("LoanStats_2018Q2.csv", skiprows=1, skipfooter=2, engine='python')
data_Q4 = pd.read_csv("LoanStats_2018Q2.csv", skiprows=1, skipfooter=2, engine='python')
frames = [data_Q1,data_Q2,data_Q3,data_Q4]
result = pd.concat(frames)
subset = result.loc[result["loan_status"].isin(['Charged Off','Fully Paid'])]
答案 0 :(得分:0)
您需要将日期(当前为对象)转换为日期时间。
然后继续执行ML。
所以您的Pandas代码将类似于
df['issue_d'] = df['issus_d'].astype('datetime64[ns]')
但是,如果您的日期时间采用某种奇怪/奇怪/非标准的格式,那么编写一个自定义函数即可
strptime (Extract the Custom time)
然后返回日期时间对象。...然后可以通过像这样的Apply()使用此函数
df['d_object'] = df.d_object.apply(my_convert_function)
希望有帮助
答案 1 :(得分:0)
'issue_d'
列包含诸如
['Mar-2018', 'Feb-2018', 'Jan-2018', 'Jun-2018', 'May-2018', 'Apr-2018']
如果将其转换为月度周期:
In [545]: periods = pd.PeriodIndex(['Mar-2018', 'Feb-2018', 'Jan-2018', 'Jun-2018', 'May-2018', 'Apr-2018'], freq='M'); periods
Out[545]: PeriodIndex(['2018-03', '2018-02', '2018-01', '2018-06', '2018-05', '2018-04'], dtype='period[M]', freq='M')
然后,我们可以使用periods <= '2018-09'
这样的表达式(是的,PeriodIndex
可以理解与字符串的比较)来创建布尔掩码,以选择要进入训练和测试DataFrames的行。
In [558]: pd.PeriodIndex(['Mar-2018', 'Feb-2018', 'Jan-2018', 'Jun-2018', 'May-2018', 'Apr-2018'], freq='M') < '2018-04'
Out[558]: array([ True, True, True, False, False, False])
import pandas as pd
# Control delimiters, rows, column names with read_csv (see later)
data_Q1 = pd.read_csv("LoanStats_2018Q1.csv", skiprows=1, skipfooter=2, engine='python')
data_Q2 = pd.read_csv("LoanStats_2018Q2.csv", skiprows=1, skipfooter=2, engine='python')
data_Q3 = pd.read_csv("LoanStats_2018Q3.csv", skiprows=1, skipfooter=2, engine='python')
data_Q4 = pd.read_csv("LoanStats_2018Q4.csv", skiprows=1, skipfooter=2, engine='python')
frames = [data_Q1,data_Q2,data_Q3,data_Q4]
result = pd.concat(frames)
subset = result.loc[result["loan_status"].isin(['Charged Off','Fully Paid'])]
subset['issue_period'] = pd.PeriodIndex(subset['issue_d'].values, freq='M')
mask = (subset['issue_period'] <= '2018-09')
train = subset.loc[mask]
test = subset.loc[~mask]