如何比较日期记录并将新列添加到数据框作为条件

时间:2019-10-08 21:30:40

标签: python pandas dataframe

我有以下传感器数据数据框:

    Data_Digital    Data_Analog         Time
        1               10           2015-02-01 00:00:00
        1               12           2015-02-01 00:00:05
        1               25           2015-02-01 07:45:07
        1               25           2015-02-01 07:45:08
        1               25           2015-02-01 21:45:10
        0               25           2015-03-04 00:00:08

我需要比较位置0的“时间”和位置1的“时间”。如果两个团队之间的时差大于六个小时,则它们必须属于不同的类别。但是,如果时间差较小(<6小时),则它们必须属于同一类。 我需要在新的dataframe列中表示此类。

所需的输出是:

      Data_Digital  Data_Analog         Time                New_Col_Target
        1               10           2015-02-01 00:00:00            1      # init with 1
        1               12           2015-02-01 00:00:05            1
        1               25           2015-02-01 07:45:07            2      # far from the previous
        1               25           2015-02-01 07:45:08            2
        1               25           2015-02-01 21:45:10            3      # far from the previous
        0               25           2015-03-04 00:00:08            4      # far from the previous

下面是原始数据框:

    import pandas as pd

    df = pd.DataFrame({'Data_Digital': [1, 1, 1, 1, 1, 0],
                       'Data_Analog':  [10, 12, 25, 25, 25, 25],
                       'Time': ['2015-02-01 00:00:00', '2015-02-01 00:00:05','2015-02-01 07:45:07',
                                '2015-02-01 07:45:08', '2015-02-01 21:45:10', '2015-03-04 00:00:08']})

    print(df)

我试图做(但这是错误的):

    index = 0
    index2 = 1

   df['New_Col_Target'] = 1

   for i in range(0, len(df) -1):
       for j in range(1, len(df)):

           if(abs(pd.to_datetime(df['Time'].iloc[i]) -         
                                 pd.to_datetime(df['Time'].iloc[j])) > 
                                 pd.to_timedelta('0 day 06:00:00')): 

               # I don't know how to do the assignments
               df['New_Col_Target'].iloc[i] = index
           else:
               # I don't know how to do the assignments
               df['New_Col_Target'].iloc[i] = index2
               index2 += 1




    # New process

       Date Init                 Date End        Mean_Dig   Mean_Analog
      2015-02-01 00:00:00   2015-02-01 00:00:05       1         11
      2015-02-01 07:45:07   2015-02-01 07:45:08       1         25
      2015-02-01 07:45:08   2015-02-01 07:45:08       1         25
      2015-03-04 00:00:08   2015-03-04 00:00:08       0         25

    df_mean_group_New_Col_Target = pd.DataFrame({'Date Init': ['2015-02-01 00:00:00', '2015-02-01 07:45:07', '2015-02-01 07:45:08', '2015-03-04 00:00:08'],
                                         'Date End': ['2015-02-01 00:00:05', '2015-02-01 07:45:08', '2015-02-01 07:45:08', '2015-03-04 00:00:08'],
                                         'Mean_Data_Digital': [1, 1, 1, 0], 
                                         'Mean_Data_Analog': [11, 25, 25, 25]})

    print(df_mean_group_New_Col_Target)

1 个答案:

答案 0 :(得分:3)

使用diffpd.Timedeltacumsum

df['New_col_target'] = (df['Time'].diff() > pd.Timedelta(hours=6)).cumsum().add(1)

输出

   Data_Digital  Data_Analog                Time  New_col_target
0             1           10 2015-02-01 00:00:00               1
1             1           12 2015-02-01 00:00:05               1
2             1           25 2015-02-01 07:45:07               2
3             1           25 2015-02-01 07:45:08               2
4             1           25 2015-02-01 21:45:10               3
5             0           25 2015-03-04 00:00:08               4

如果您的Time列还不是datetime,请使用:

df['Time'] = pd.to_datetime(df['Time'])

方案2:每组Data_Digital

我们必须使用groupby

m1 = df.groupby('Data_Digital')['Time'].diff().ge(pd.Timedelta(hours=6))
m2 = df['Data_Digital'].diff().ne(0)

df['New_col_target'] = (m1|m2).cumsum()

输出

   Data_Digital  Data_Analog                Time  New_col_target
0             1           10 2015-02-01 00:00:00               1
1             1           12 2015-02-01 00:00:05               1
2             1           25 2015-02-01 07:45:07               2
3             1           25 2015-02-01 07:45:08               2
4             1           25 2015-02-01 21:45:10               3
5             0           25 2015-03-04 00:00:08               4

最后,获取每个目标的平均值

我们可以使用groupby.mean

df.groupby('New_col_target',as_index=False)[['Data_Digital', 'Data_Analog']].mean()

df.groupby('New_col_target',as_index=False).agg({'Data_Digital':'mean',
                                                 'Data_Analog':'mean'})

或者如果您有pandas >= 0.25.0(请检查pd.__version__),我们可以使用named_agreggations

df.groupby('New_col_target').agg(
    Digital_mean=('Data_Digital', 'mean'),
    Analog_mean=('Data_Analog', 'mean')
).reset_index()

输出

   New_col_target  Data_Digital  Data_Analog
0               1             1           11
1               2             1           25
2               3             1           25
3               4             0           25

输出命名聚合

   New_col_target  Digital_mean  Analog_mean
0               1             1           11
1               2             1           25
2               3             1           25
3               4             0           25