附件：重现数据帧

Question

我有一个用于机器学习的数据集。但是，我想将数据集分为训练集和测试集。我的培训应包括9月份之前发放的所有贷款。我的测试将包括其余部分（例如，四月至十月，十一月，十二月）。如何以我认为最适合此任务的方式准备数据集？

    issue_d int_rate    installment dti revol_bal   revol_util  inq_last_6mths  delinq_2yrs pub_rec loan_status purpose_credit_card purpose_debt_consolidation  purpose_home_improvement    purpose_house   purpose_major_purchase  purpose_medical purpose_moving  purpose_other   purpose_renewable_energy    purpose_small_business  purpose_vacation    purpose_wedding
11  Mar-2018    14.07%  233.05  24.69   707 15.7%   0   0   0   1   0   0   0   0   1   0   0   0   0   0   0   0
16  Mar-2018    11.98%  232.44  20.25   5004    36% 0   0   0   1   0   0   1   0   0   0   0   0   0   0   0   0
17  Mar-2018    26.77%  607.97  24.40   7364    46% 1   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0
20  Mar-2018    20.39%  560.94  15.76   14591   34.2%   0   1   0   1   0   0   0   1   0   0   0   0   0   0   0   0
23  Mar-2018    7.34%   930.99  16.18   755 0%  0   1   0   1   0   0   0   1   0   0   0   0   0   0   0   0
...
130741  Apr-2018    6.07%   309.85  14.64   17380   24.5%   1   0   0   1   0   1   0   0   0   0   0   0   0   0   0   0
130742  Apr-2018    11.98%  555.86  21.05   19591   20.5%   2   0   0   1   0   1   0   0   0   0   0   0   0   0   0   0
130744  Apr-2018    11.98%  215.84  14.68   4707    37.7%   1   0   0   1   0   1   0   0   0   0   0   0   0   0   0   0

issue_d的类型为object。

到目前为止，我并不担心使用以下日期：

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=123, stratify=y)

附件：重现数据帧

您可以下载CSV here（2018年的银行贷款。它们分为四个季度）。可以通过以下方式使用Python 3：

import pandas as pd 
# Control delimiters, rows, column names with read_csv (see later) 
data_Q1 = pd.read_csv("LoanStats_2018Q1.csv", skiprows=1, skipfooter=2, engine='python')
data_Q2 = pd.read_csv("LoanStats_2018Q2.csv", skiprows=1, skipfooter=2, engine='python')
data_Q3 = pd.read_csv("LoanStats_2018Q2.csv", skiprows=1, skipfooter=2, engine='python')
data_Q4 = pd.read_csv("LoanStats_2018Q2.csv", skiprows=1, skipfooter=2, engine='python')
frames = [data_Q1,data_Q2,data_Q3,data_Q4]

result = pd.concat(frames)
subset = result.loc[result["loan_status"].isin(['Charged Off','Fully Paid'])]

Answer 1

您需要将日期（当前为对象）转换为日期时间。

然后继续执行ML。

所以您的Pandas代码将类似于

df['issue_d'] = df['issus_d'].astype('datetime64[ns]')

但是，如果您的日期时间采用某种奇怪/奇怪/非标准的格式，那么编写一个自定义函数即可

 strptime  (Extract the Custom time)

然后返回日期时间对象。...然后可以通过像这样的Apply（）使用此函数

df['d_object'] = df.d_object.apply(my_convert_function)

希望有帮助

Answer 2

'issue_d'列包含诸如

之类的字符串。

['Mar-2018', 'Feb-2018', 'Jan-2018', 'Jun-2018', 'May-2018', 'Apr-2018']

如果将其转换为月度周期：

In [545]: periods = pd.PeriodIndex(['Mar-2018', 'Feb-2018', 'Jan-2018', 'Jun-2018', 'May-2018', 'Apr-2018'], freq='M'); periods
Out[545]: PeriodIndex(['2018-03', '2018-02', '2018-01', '2018-06', '2018-05', '2018-04'], dtype='period[M]', freq='M')

然后，我们可以使用periods <= '2018-09'这样的表达式（是的，PeriodIndex可以理解与字符串的比较）来创建布尔掩码，以选择要进入训练和测试DataFrames的行。

In [558]: pd.PeriodIndex(['Mar-2018', 'Feb-2018', 'Jan-2018', 'Jun-2018', 'May-2018', 'Apr-2018'], freq='M') < '2018-04'
Out[558]: array([ True,  True,  True, False, False, False])

import pandas as pd 
# Control delimiters, rows, column names with read_csv (see later) 
data_Q1 = pd.read_csv("LoanStats_2018Q1.csv", skiprows=1, skipfooter=2, engine='python')
data_Q2 = pd.read_csv("LoanStats_2018Q2.csv", skiprows=1, skipfooter=2, engine='python')
data_Q3 = pd.read_csv("LoanStats_2018Q3.csv", skiprows=1, skipfooter=2, engine='python')
data_Q4 = pd.read_csv("LoanStats_2018Q4.csv", skiprows=1, skipfooter=2, engine='python')
frames = [data_Q1,data_Q2,data_Q3,data_Q4]
result = pd.concat(frames)
subset = result.loc[result["loan_status"].isin(['Charged Off','Fully Paid'])]

subset['issue_period'] = pd.PeriodIndex(subset['issue_d'].values, freq='M')
mask = (subset['issue_period'] <= '2018-09')
train = subset.loc[mask]
test = subset.loc[~mask]

如何按日期过滤数据框列？

附件：重现数据帧

2 个答案: