我希望将数据帧的一列中的值相加到另一个数据帧定义的某些日期。
我的第一个日期数据框如下所示:
import numpy as np
import pandas as pd
start_date = ["2-22-16 00:00:00", "2-29-16 00:00:00", "3-7-16 00:00:00", "3-14-16 00:00:00", "3-21-16 00:00:00", "3-28-16 00:00:00", "4-4-16 00:00:00", "4-11-16 00:00:00", "4-18-16 00:00:00", "4-25-16 00:00:00", "5-2-16 00:00:00", "5-9-16 00:00:00", "5-16-16 00:00:00", "5-23-16 00:00:00", "5-30-16 00:00:00", "6-6-16 00:00:00", "6-13-16 00:00:00", "6-20-16 00:00:00", "6-27-16 00:00:00", "7-4-16 00:00:00", "7-11-16 00:00:00", "7-18-16 00:00:00", "7-25-16 00:00:00", "8-08-16 00:00:00", "8-22-16 00:00:00", "8-29-16 00:00:00", "9-5-16 00:00:00", "9-12-16 00:00:00", "9-19-16 00:00:00", "9-26-16 00:00:00", "10-3-16 00:00:00", "10-10-16 00:00:00", "10-17-16 00:00:00", "10-24-16 00:00:00", "10-31-16 00:00:00", "11-7-16 00:00:00", "11-14-16 00:00:00", "11-21-16 00:00:00", "1-23-17 00:00:00", "1-30-17 00:00:00", "2-06-17 00:00:00", "3-13-17 00:00:00", "3-27-17 00:00:00", "6-19-17 00:00:00", "6-26-17 00:00:00"]
start_date = [pd.to_datetime(d) for d in start_date]
end_date = pd.DatetimeIndex(start_date) + pd.DateOffset(7)
ndf = pd.DataFrame({'start':pd.to_datetime(start_date),'end':end_date}); ndf.head()
我想要的是另一个数据框中的值,这些数据框属于ndf
中定义的周数。我的其他数据框看起来像这样:
dates = ["4-17-16 04:00:00", "4-16-16 19:30:00", "4-16-16 19:00:00", "2-24-16 09:00:00", "3-21-16 02:00:00", "3-18-16 10:00:00", "3-24-16 05:00:00", "3-11-16 00:00:00"]
df = pd.DataFrame(
{'timestamp': dates,
'value': np.random.randint(1,25,size=(8,))})
现在,我想创建一个新数据框,该数据框汇总values
中df
之间ndf
的所有def get_dates(x):
# Select the df values between start and ending datetime.
n = df[(df['timestamp']>ndf['start'])&(df['timestamp']<ndf['end'])]
# Return sum of values
return n.values[0],n['value'].sum()
。所以我创建了这个函数:
n = df[(df['timestamp']>ndf['start'])&(df['timestamp']<ndf['end'])]
我也玩过这个:ValueError: Can only compare identically-labeled Series objects
。但我收到错误:{{1}}。
我正在寻找帮助我清理功能的人,以便它可以正常工作或提供有关上述错误消息的信息。谢谢!
答案 0 :(得分:2)
对于您开始日期和结束日期形成一个连续时间段的特定情况,您可能希望使用以下内容:
def get_dates():
# Select the df values between start and ending datetime.
n = df[(df['timestamp'] > ndf['start'].min()) &
(df['timestamp'] < ndf['end'].max())]
# Return sum of values
return n.values[0], n['value'].sum()
并且您的错误表示您正在尝试比较不同长度的数组元素。当ndf
有1000
df
有45行
修改强> 我不确定在不连续的时间段内是否存在比在两个数据帧上迭代更漂亮的解决方案:
def get_dates():
count = 0
for index, values_row in df.iterrows():
for _, time_deltas_row in ndf.iterrows():
if time_deltas_row['start'] < values_row['timestamp'] < time_deltas_row['end']:
count += 1
continue
return count
答案 1 :(得分:1)
如果要按均匀间隔的时间间隔对数据进行分组,请使用resample。
df.set_index('timestamp').resample('w-mon', label='left').sum().reset_index()
返回:
timestamp value
0 2016-02-22 22.0
1 2016-02-29 NaN
2 2016-03-07 13.0
3 2016-03-14 20.0
4 2016-03-21 9.0
5 2016-03-28 NaN
6 2016-04-04 NaN
7 2016-04-11 34.0