熊猫从时间戳创建新的可分类列

时间:2018-07-17 11:19:59

标签: python pandas classification

我正在尝试创建一个新的分类列'Stages_So'并将其发布到我的原始数据框中。

Event_Code Timestamp
2053    13/08/2016 11:30
1029    10/09/2016 14:00
2053    02/10/2016 13:15
2053    06/11/2016 16:30
2053    19/11/2016 15:00
2053    03/12/2016 17:30
1029    02/01/2017 15:00
1029    05/02/2017 16:00
2053    11/02/2017 15:00
1029    04/03/2017 15:00
2053    01/04/2017 14:00
1029    21/05/2017 14:00

我尝试了以下功能。

def label_stage(row):
    if row['Timestamp'] > '2016-08-12' and row['Timestamp'] < '2016-11-07':
        return 0
    if row['Timestamp'] > '2016-11-18' and row['Timestamp'] < '2017-02-06':
        return 1
    if row['Timestamp'] > '2017-02-10' and row['Timestamp'] < '2017-05-22':
        return 2


df['Stages_So'] = df.apply(lambda row: label_stage(row), axis=1)

但是它给出了一个错误。 TypeError: ("Cannot compare type 'Timestamp' with type 'str'", 'occurred at index 957')

1 个答案:

答案 0 :(得分:1)

您首先需要在to_datetime之前将列转换为日期时间,然后在datetime s之间进行比较:

df['Timestamp'] = pd.to_datetime(df['Timestamp'])

def label_stage(row):
    if row['Timestamp'] > pd.Timestamp('2016-08-12') and 
       row['Timestamp'] < pd.Timestamp('2016-11-07'):
        return 0
    if row['Timestamp'] > pd.Timestamp('2016-11-18') and 
       row['Timestamp'] < pd.Timestamp('2017-02-06'):
        return 1
    if row['Timestamp'] > pd.Timestamp('2017-02-10') and 
       row['Timestamp'] < pd.Timestamp('2017-05-22'):
        return 2

df['Stages_So'] = df.apply(lambda row: label_stage(row), axis=1)
print (df)
    Event_Code           Timestamp  Stages_So
0         2053 2016-08-13 11:30:00        0.0
1         1029 2016-10-09 14:00:00        0.0
2         2053 2016-02-10 13:15:00        NaN
3         2053 2016-06-11 16:30:00        NaN
4         2053 2016-11-19 15:00:00        1.0
5         2053 2016-03-12 17:30:00        NaN
6         1029 2017-02-01 15:00:00        1.0
7         1029 2017-05-02 16:00:00        2.0
8         2053 2017-11-02 15:00:00        NaN
9         1029 2017-04-03 15:00:00        2.0
10        2053 2017-01-04 14:00:00        1.0
11        1029 2017-05-21 14:00:00        2.0

另一个更快的解决方案:

df['Timestamp'] = pd.to_datetime(df['Timestamp'])

m1 = (df['Timestamp'] > '2016-08-12') & (df['Timestamp'] < '2016-11-07')
m2 = (df['Timestamp'] > '2016-11-18') & (df['Timestamp'] < '2017-02-06')
m3 = (df['Timestamp'] > '2017-02-10') & (df['Timestamp'] < '2017-05-22')

df['Stages_So'] = np.select([m1, m2, m3], [0,1,2], default=np.nan)
print (df)
    Event_Code           Timestamp  Stages_So
0         2053 2016-08-13 11:30:00        0.0
1         1029 2016-10-09 14:00:00        0.0
2         2053 2016-02-10 13:15:00        NaN
3         2053 2016-06-11 16:30:00        NaN
4         2053 2016-11-19 15:00:00        1.0
5         2053 2016-03-12 17:30:00        NaN
6         1029 2017-02-01 15:00:00        1.0
7         1029 2017-05-02 16:00:00        2.0
8         2053 2017-11-02 15:00:00        NaN
9         1029 2017-04-03 15:00:00        2.0
10        2053 2017-01-04 14:00:00        1.0
11        1029 2017-05-21 14:00:00        2.0