Question

我有一个这样的数据框：

data= {'Timestamp': ['2018-07-16 14:31:03','2018-07-13 11:59:50','2018-07-13 11:41:07','2018-07-13 10:50:24','2018-07-12 15:33:59','2018-07-12 11:32:52','2018-07-04 13:10:30','2018-07-04 10:37:15' ],
        'Maturity': [2019,2019, 2020,2020,2020,2020, 2021,2021],
        'Country': ['DE','DE','ES','ES','DE','DE', 'ES','ES'],
        'Price': [50.15, 51, 66, 68, 55, 54, 72.7, 73]         
        }
df = pd.DataFrame(data)
df.index = pd.DatetimeIndex(df.Timestamp)
df.drop(columns=['Timestamp'], inplace=True)
print(df)

导致此df：

    Timestamp   Country Maturity    Price
16.07.2018 14:31    DE  2019     50.15 
13.07.2018 11:59    DE  2019     51.00 
13.07.2018 11:41    ES  2020     66.00 
13.07.2018 10:50    ES  2020     68.00 
12.07.2018 15:33    DE  2020     55.00 
12.07.2018 11:32    DE  2020     54.00 
04.07.2018 13:10    ES  2021     72.70 
04.07.2018 10:37    ES  2021     73.00

我想对数据框进行重新采样或分组，以获取每个“国家/地区”和“到期日”的每一天的最后“价格”。

结果应如下所示：

Timestamp   Country Maturity Price
16.07.2018  DE      2019     50.15 
13.07.2018  DE      2019     51.00 
13.07.2018  ES      2020     66.00 
12.07.2018  DE      2020     55.00 
04.07.2018  ES      2021     72.70

我尝试过使用df = df.resample('D', on='Timestamp')['Price'].agg(['last']) 但不幸的是，这会导致错误。

有人可以帮助解决这个问题吗？

Answer 1

我认为需要groupby和Grouper和GroupBy.last：

df = df.groupby(['Maturity','Country', pd.Grouper(freq='D')])['Price'].last().reset_index()

或使用DataFrameGroupBy.resample，但必须通过dropna删除丢失的行：

df = df.groupby(['Maturity','Country']).resample('D')['Price'].last().dropna().reset_index()
print (df)

   Maturity Country  Timestamp  Price
0      2019      DE 2018-07-13  51.00
1      2019      DE 2018-07-16  50.15
2      2020      DE 2018-07-12  55.00
3      2020      ES 2018-07-13  66.00
4      2021      ES 2018-07-04  72.70

Answer 2

另一种无需抽样即可解决此问题的方法，

使用 drop_duplicates 和日期，国家/地区和到期时间键，默认情况下，它将保留第一条记录。

> customers
  customer_id   recency frequency amount
1           1 -486.7917         2   40.6
2           2 -520.7917         1   47.9
>

输出：

data= {'Timestamp': ['2018-07-16 14:31:03','2018-07-13 11:59:50','2018-07-13 11:41:07','2018-07-13 10:50:24','2018-07-12 15:33:59','2018-07-12 11:32:52','2018-07-04 13:10:30','2018-07-04 10:37:15' ],
        'Maturity': [2019,2019, 2020,2020,2020,2020, 2021,2021],
        'Country': ['DE','DE','ES','ES','DE','DE', 'ES','ES'],
        'Price': [50.15, 51, 66, 68, 55, 54, 72.7, 73]         
        }
df = pd.DataFrame(data)
df.index = pd.DatetimeIndex(df.Timestamp)
df['date']=df.index.date
df= df.drop_duplicates(subset=['date','Country','Maturity'])
df.drop(['Timestamp','date'],axis=1, inplace=True)
print df

获取日期时间索引数据框中按其他列值过滤的一天的最后一个值

2 个答案: