Question

考虑具有稀疏时态数据的数据帧。时间戳可能非常陈旧（例如几年前）或非常近期。

例如，让我们采用以下数据框：

                    tstamp     item_id   budget
2016-07-01 14:56:51.882649  0Szr8SuNbY  5000.00
2016-07-20 14:57:23.856878  0Szr8SuNbY  5700.00
2016-07-17 16:32:27.838435  0Lzu1xOM87   303.51
2016-07-30 21:50:03.655102  0Lzu1xOM87    94.79
2016-08-01 14:56:31.081140  0HzuoujTsN   100.00

假设我们需要为每个item_id 重新采样此数据框，以便我们获得具有一个数据点的密集数据框对于预定义日期范围的每一天，使用向前填充。

换句话说，如果我重新采样以上时间间隔

pd.date_range(date(2016,7,15), date(2016,7,31)

我应该得到：

date item_id budget 2016-07-15 0Szr8SuNbY 5000.00 2016-07-16 0Szr8SuNbY 5000.00 2016-07-17 0Szr8SuNbY 5000.00 ... 2016-07-31 0Szr8SuNbY 5000.00 2016-07-15 0Lzu1xOM87 NaN 2016-07-16 0Lzu1xOM87 NaN 2016-07-17 0Lzu1xOM87 303.51 ... 2016-07-31 0Lzu1xOM87 94.79 2016-07-15 0HzuoujTsN NaN 2016-07-16 0HzuoujTsN NaN 2016-07-17 0HzuoujTsN NaN ... 2016-07-31 0HzuoujTsN NaN

请注意，原始数据框包含稀疏时间戳和可能非常高数的唯一item_id。换句话说，我希望找到一种计算有效方式，在预先确定的考虑时间段内以每日频率重新采样这些数据。

我们在Pandas，numpy或Python中能做的最好的事情是什么？

Answer 1

您可以在groupby上'item_id'并在每个群组上致电reindex：

# Define the new time interval.
new_dates = pd.date_range('2016-07-15', '2016-07-31', name='date')

# Set the current time stamp as the index and perform the groupby.
df = df.set_index(['tstamp'])
df = df.groupby('item_id').apply(lambda grp: grp['budget'].reindex(new_dates, method='ffill').to_frame())

# Reset the index to remove 'item_id' and 'date' from the index.
df = df.reset_index()

另一种选择是pivot，reindex和unstack：

# Define the new time interval.
new_dates = pd.date_range('2016-07-15', '2016-07-31', name='date')

# Pivot to have 'item_id' columns with 'budget' values.
df = df.pivot(index='tstamp', columns='item_id', values='budget').ffill()

# Reindex with the new dates.
df = df.reindex(new_dates, method='ffill')

# Unstack and reset the index to return to the original format.
df = df.unstack().reset_index().rename(columns={0:'budget'})

在预定义的日期范围内使用稀疏时间戳进行高效数据采样

1 个答案: