根据列值

时间:2016-07-20 16:39:14

标签: python pandas

我的数据框如下:

        field1    field2    field3
time
  t1         1         1         1
  t2         1         1         0
  t3         2         3         1
  t4         3         3         0
  t5         1         2         0     

时间采用yyyy-mm-dd hh:mm:ss形式,目前正在为数据框编制索引。

field 1field 2用于标识项目,以便元组(field1,field2)对应于世界某处的特定传感器。 field 3是给定时间内该传感器的值,取值0或1。

我希望通过(field1,field2)对数据帧进行分组,并将每个传感器从字段3中获取每个值的总时间相加。因此,如果t1='2016-07-20 00:00:00't2='2016-07-20 00:01:00',当前时间是'2016-07-20 00:03:00',我会有一个新的数据框,如下所示:

            field3=0    field3=1
(1,1)          2 min       1 min
(2,3)            ...         ...
(3,3)            ...         ...  
(1,2)            ...         ...

我假设从t1t2field3的值为1,从t2起,它为0,因为(1,1)不再出现在数据框中。 1 min来自t2 - t12 min来自current_time - t2

2 min1 min可以是任何格式(可以是总分钟/秒,时间分数等等)

我尝试过以下方法:

import pandas as pd
from collections import defaultdict, namedtuple

# so i can create a defaultdict(Field3) and save some logic
class Field3(object):
    def __init__(self):
            self.zero= pd.Timedelta('0 days')
            self.one = pd.Timedelta('0 days')

# used to map to field3 in a dictionary
Sensor = namedtuple('Sensor','field1 field2')

# the dataframe mentioned above
df = pd.DataFrame(...)

# iterate through each  row of the dataframe and map from (field1,field2) to
# field3, adding time based on the value of field3 in the frame and the 
# time difference between this row and the next
rows = list(df.iterrows())
sensor_to_field3 = defaultdict(Field3)
for i in xrange(len(rows)-1):
        sensor = Sensor(field1=rows[i][1][0],field2=rows[i][1][1])
        if rows[i][1][2]: sensor_to_field3[spot].one += rows[i+1][0]-rows[i][0]
        else: spot_to_status[spot].zero += rows[i+1][0]-rows[i][0]
spot_to_status = {k:[v] for k,v in sensor_to_field3.iteritems()}
result = pd.DataFrame(sensor_to_field3,index=[0])

它基本上让我得到了但我想要(虽然目前只有当整个表格中都有单个传感器时才有效,如果有的话,我真的不想处理它更好的解决方法)。

我觉得必须有更好的方法来解决这个问题。类似于field1,field2上的groupby,然后根据field3time索引汇总timedeltas,但我不知道如何去做。

1 个答案:

答案 0 :(得分:0)

管理得到它,以防其他人遇到类似的东西。仍然不确定它是否是最佳的,但它感觉比我正在做的更好。

我更改了原始数据框,将时间作为列包含在内,只使用整数索引。

def create_time_deltas(dataframe):
    # add a timedelta column
    dataframe['timedelta'] = pd.Timedelta(minutes=0)
    # iterate over each row and set the timedelta to the difference of the next one and this one
    for i in dataframe.index[:-1]:
            dataframe.set_value(i,'timedelta',dataframe.loc[i+1,'time']dataframe.loc[i,'time'])
    # set the last time value, which couldn't be set earlier because index out of bounds
    dataframe.set_value(i+1,'timedelta',pd.to_datetime(datetime.now())-dataframe.loc[i,'time'])
    return dataframe

def group_by_field3_time(dataframe, start=None, stop=None):
    # optionally set time bounds on what to care about
    stop = stop or pd.to_datetime(datetime.now())
    recent = dataframe.loc[logical_and(start < df['time'] , df['time'] < stop)]
    # groupby and apply to create a new dataframe with the time_deltas column 
    by_td = df.groupby(['field1','field2']).apply(create_time_deltas)
    # sum the timedeltas for each triple, which can be used later
    by_oc = by_td.groupby(['field1','field2','field3']).sum()
    return by_oc

如果有人能想出更好的方法来做到这一点,我会全力以赴,但这确实比在整个地方创建词典要好得多。