如何从连续的列值中提取最大和最小时间?

时间:2018-07-14 02:07:40

标签: pandas

我下面有数据框。 我想从连续的列值中提取最大和最小时间。 我该怎么办?

import pandas as pd
import numpy as np

raw_data = {'Time':[281.54385,298.64380,321.29645,321.39640,419.58545,430.68540,
533.96025,580.37990,590.85605,634.06015,724.16010,750.26000,
777.87955,830.97945,850.07940],
       'CF_A': [1,1,1,0,0,0,1,1,1,2,2,2,0,0,0],
       'CF_B': [1,1,1,1,1,1,0,0,0,0,1,1,1,0,0],
       'CF_C': [0,0,2,2,3,3,3,3,1,1,1,1,0,0,0],
       }

data = pd.DataFrame(raw_data)

dataframe - Input (see picture)

每列中的变量连续出现,我想添加新的数据框 总结与序列开始和结束相对应的时间。

想要的结果在下面。

result (see picture)

1 个答案:

答案 0 :(得分:0)

我建议对case使用索引,以避免使用多个具有相同值的列名:

#filter column with CF
cols = data.filter(like='CF').columns
#output list of Series
L = []
for col in cols:
    #create groups by consecutive values
    s = data[col].ne(data[col].shift()).cumsum().rename('g')
    #grouping by each column with helper groups
    g = data.groupby([s, data[col]])['Time']
    #difference by first and last value
    d = g.last() - g.first()
    #append sum by second level of MultiIndex
    L.append(d.sum(level=1))

#join all Series together, cases are index values
df = pd.concat(L, axis=1, keys=cols).fillna(0)
print (df)
        CF_A       CF_B       CF_C
0  181.48885  119.19985   89.29980
1   96.64840  202.86100  159.40395
2  116.19985    0.00000    0.09995
3    0.00000    0.00000  160.79445

但是如果真的需要预期的输出:

#filter column with CF
df1 = data.filter(like='CF')
#flatten all values of cases, get sorted unique values
idx = np.sort(np.unique(df1.values.ravel()))
print (idx)

#output list of Dataframes
L = []
for col in df1.columns:
    #create groups by consecutive values
    s = data[col].ne(data[col].shift()).cumsum().rename('g')
    #grouping by each column with helper groups
    g = data.groupby([s, data[col]])['Time']
    #difference by first and last value
    d = g.last() - g.first()
    #sum by second level of MultiIndex, add missing rows by reindex
    df = d.sum(level=1).rename_axis('Case').reindex(idx, fill_value=0).reset_index()
    #append df with renamed columns names
    L.append(df.add_prefix(col + '_'))

#join all DataFrames together
df = pd.concat(L, axis=1)
print (df)
   CF_A_Case  CF_A_Time  CF_B_Case  CF_B_Time  CF_C_Case  CF_C_Time
0          0  181.48885          0  119.19985          0   89.29980
1          1   96.64840          1  202.86100          1  159.40395
2          2  116.19985          2    0.00000          2    0.09995
3          3    0.00000          3    0.00000          3  160.79445