如何按最大日期获取行,然后添加包含所有日期和分钟日期的列

时间:2018-06-06 16:50:20

标签: python pandas

我基本上是在尝试做这个问题:How to get rows by max date with certain columns?

但是,我还想要两个新列:

  1. 一个叫日期(包含给定I和II组合的所有日期,如果排序最好)
  2. 一个名为min_date(包含给定I和II组合的最小日期)。
  3. 一个叫天(包含最大和最小日期之间天数的差异)
  4. 按照原始问题的例子:

       I II III        IV        dates                 min_date   days_diff
    0  A  X 2017-01-30 some_data 2017-01-30|2017-01-27 2017-01-27 2
    1  A  Y 2017-01-30 some_data 2017-01-30|2017-01-27 2017-01-27 2
    2  A  Z 2017-01-30 some_data 2017-01-30|2017-01-27 2017-01-27 2
    6  B  X 2017-01-30 some_data 2017-01-30|2017-01-27 2017-01-27 2
    7  B  Y 2017-01-30 some_data 2017-01-30|2017-01-27 2017-01-27 2
    8  B  Z 2017-01-30 some_data 2017-01-30|2017-01-27 2017-01-27 2
    

    我可以在for循环中执行此操作,查找每个唯一I和II组合的所有行:

    data = [
        ('I', 'II', 'III', 'IV'),
        ('A', 'X', '1/30/2017 9:33:00 AM', 'some_data'),
        ('A', 'Y', '1/30/2017 9:33:00 AM', 'some_data'),
        ('A', 'Z', '1/30/2017 9:33:00 AM', 'some_data'),
        ('A', 'X', '1/27/2017 4:53:00 PM', 'some_data'),
        ('A', 'Y', '1/27/2017 4:53:00 PM', 'some_data'),
        ('A', 'Z', '1/27/2017 4:53:00 PM', 'some_data'),
        ('B', 'X', '1/30/2017 9:33:00 AM', 'some_data'),
        ('B', 'Y', '1/30/2017 9:33:00 AM', 'some_data'),
        ('B', 'Z', '1/30/2017 9:33:00 AM', 'some_data'),
        ('B', 'X', '1/27/2017 4:53:00 PM', 'some_data'),
        ('B', 'Y', '1/27/2017 4:53:00 PM', 'some_data'),
        ('B', 'Z', '1/27/2017 4:53:00 PM', 'some_data'),
    ]
    
    import pandas as pd
    df = pd.DataFrame(data[1:], columns=data[0])
    df['III'] = pd.to_datetime(df['III'])
    
    # groupby first two columns, then get the maximum value in the third column
    idx = df.groupby(['I', 'II'])['III'].transform(max) == df['III']
    
    # use the index to fetch correct rows in dataframe
    df_dedup = df[idx]
    df_dedup['dates'] = ''
    df_dedup['min_date'] = ''
    df_dedup['days_diff'] = ''
    
    
    # now iterate across all rows of df_dedup and find min and all dates
    for idx, row in df_dedup.iterrows():
        target_idx = (df['I'] == row['I']) & (df['II'] == row['II'])
        dates = '|'.join(df[target_idx]['III'].astype('str'))
        min_date = min(df[target_idx]['III'])
        days_diff = row['III']-min_date
        (df_dedup['dates'],df_dedup['min_date'],df_dedup['days_diff']) = dates, min_date, days_diff
    

    然而,对于大df而言,这非常慢。我正在寻找有关使用熊猫矢量化的帮助,所以它要快得多。任何想法都将不胜感激。

    此特定示例的输出为:

    print(df_dedup)
       I II                 III         IV  \
    0  A  X 2017-01-30 09:33:00  some_data   
    1  A  Y 2017-01-30 09:33:00  some_data   
    2  A  Z 2017-01-30 09:33:00  some_data   
    6  B  X 2017-01-30 09:33:00  some_data   
    7  B  Y 2017-01-30 09:33:00  some_data   
    8  B  Z 2017-01-30 09:33:00  some_data   
                                         dates            min_date       days_diff  
    0  2017-01-30 09:33:00|2017-01-27 16:53:00 2017-01-27 16:53:00 2 days 16:40:00  
    1  2017-01-30 09:33:00|2017-01-27 16:53:00 2017-01-27 16:53:00 2 days 16:40:00  
    2  2017-01-30 09:33:00|2017-01-27 16:53:00 2017-01-27 16:53:00 2 days 16:40:00  
    6  2017-01-30 09:33:00|2017-01-27 16:53:00 2017-01-27 16:53:00 2 days 16:40:00  
    7  2017-01-30 09:33:00|2017-01-27 16:53:00 2017-01-27 16:53:00 2 days 16:40:00  
    8  2017-01-30 09:33:00|2017-01-27 16:53:00 2017-01-27 16:53:00 2 days 16:40:00  
    

1 个答案:

答案 0 :(得分:1)

只需按照您在之前发布的内容中所做的操作,这次我们还需要准备groupby min

s1,s2=df.groupby('I')['III'].transform('min'),df.groupby('I')['III'].transform('max')
df['min_date']=s1;df['dates']=s1.dt.date.astype(str)+'|'+s2.dt.date.astype(str);df['days_diff']=s2-s1
print(df.loc[df['III']==s2,:])
   I II                 III         IV            min_date  \
0  A  X 2017-01-30 09:33:00  some_data 2017-01-27 16:53:00   
1  A  Y 2017-01-30 09:33:00  some_data 2017-01-27 16:53:00   
2  A  Z 2017-01-30 09:33:00  some_data 2017-01-27 16:53:00   
6  B  X 2017-01-30 09:33:00  some_data 2017-01-27 16:53:00   
7  B  Y 2017-01-30 09:33:00  some_data 2017-01-27 16:53:00   
8  B  Z 2017-01-30 09:33:00  some_data 2017-01-27 16:53:00   
                   dates       days_diff  
0  2017-01-27|2017-01-30 2 days 16:40:00  
1  2017-01-27|2017-01-30 2 days 16:40:00  
2  2017-01-27|2017-01-30 2 days 16:40:00  
6  2017-01-27|2017-01-30 2 days 16:40:00  
7  2017-01-27|2017-01-30 2 days 16:40:00  
8  2017-01-27|2017-01-30 2 days 16:40:00