我的数据框如下:
field1 field2 field3
time
t1 1 1 1
t2 1 1 0
t3 2 3 1
t4 3 3 0
t5 1 2 0
时间采用yyyy-mm-dd hh:mm:ss
形式,目前正在为数据框编制索引。
field 1
和field 2
用于标识项目,以便元组(field1,field2)
对应于世界某处的特定传感器。 field 3
是给定时间内该传感器的值,取值0或1。
我希望通过(field1,field2)对数据帧进行分组,并将每个传感器从字段3中获取每个值的总时间相加。因此,如果t1='2016-07-20 00:00:00'
和t2='2016-07-20 00:01:00'
,当前时间是'2016-07-20 00:03:00'
,我会有一个新的数据框,如下所示:
field3=0 field3=1
(1,1) 2 min 1 min
(2,3) ... ...
(3,3) ... ...
(1,2) ... ...
我假设从t1
到t2
,field3
的值为1,从t2
起,它为0,因为(1,1)不再出现在数据框中。 1 min
来自t2 - t1
而2 min
来自current_time - t2
2 min
和1 min
可以是任何格式(可以是总分钟/秒,时间分数等等)
我尝试过以下方法:
import pandas as pd
from collections import defaultdict, namedtuple
# so i can create a defaultdict(Field3) and save some logic
class Field3(object):
def __init__(self):
self.zero= pd.Timedelta('0 days')
self.one = pd.Timedelta('0 days')
# used to map to field3 in a dictionary
Sensor = namedtuple('Sensor','field1 field2')
# the dataframe mentioned above
df = pd.DataFrame(...)
# iterate through each row of the dataframe and map from (field1,field2) to
# field3, adding time based on the value of field3 in the frame and the
# time difference between this row and the next
rows = list(df.iterrows())
sensor_to_field3 = defaultdict(Field3)
for i in xrange(len(rows)-1):
sensor = Sensor(field1=rows[i][1][0],field2=rows[i][1][1])
if rows[i][1][2]: sensor_to_field3[spot].one += rows[i+1][0]-rows[i][0]
else: spot_to_status[spot].zero += rows[i+1][0]-rows[i][0]
spot_to_status = {k:[v] for k,v in sensor_to_field3.iteritems()}
result = pd.DataFrame(sensor_to_field3,index=[0])
它基本上让我得到了但我想要(虽然目前只有当整个表格中都有单个传感器时才有效,如果有的话,我真的不想处理它更好的解决方法)。
我觉得必须有更好的方法来解决这个问题。类似于field1,field2
上的groupby,然后根据field3
和time
索引汇总timedeltas,但我不知道如何去做。
答案 0 :(得分:0)
管理得到它,以防其他人遇到类似的东西。仍然不确定它是否是最佳的,但它感觉比我正在做的更好。
我更改了原始数据框,将时间作为列包含在内,只使用整数索引。
def create_time_deltas(dataframe):
# add a timedelta column
dataframe['timedelta'] = pd.Timedelta(minutes=0)
# iterate over each row and set the timedelta to the difference of the next one and this one
for i in dataframe.index[:-1]:
dataframe.set_value(i,'timedelta',dataframe.loc[i+1,'time']dataframe.loc[i,'time'])
# set the last time value, which couldn't be set earlier because index out of bounds
dataframe.set_value(i+1,'timedelta',pd.to_datetime(datetime.now())-dataframe.loc[i,'time'])
return dataframe
def group_by_field3_time(dataframe, start=None, stop=None):
# optionally set time bounds on what to care about
stop = stop or pd.to_datetime(datetime.now())
recent = dataframe.loc[logical_and(start < df['time'] , df['time'] < stop)]
# groupby and apply to create a new dataframe with the time_deltas column
by_td = df.groupby(['field1','field2']).apply(create_time_deltas)
# sum the timedeltas for each triple, which can be used later
by_oc = by_td.groupby(['field1','field2','field3']).sum()
return by_oc
如果有人能想出更好的方法来做到这一点,我会全力以赴,但这确实比在整个地方创建词典要好得多。