计算熊猫中连续值组的最大运行长度

时间:2018-10-20 15:41:17

标签: pandas

给出这样的数据集:

let arr = [ 'test', '1994', 'test', 'test', '2018'],
    isYear = (s) => !isNaN(Number(s.trim())),
    result = arr.reduce((a, s) => {
      if (isYear(s)) a.years.push(s);
      else a.strings.push(s);
      return a;
    } , {years: [], strings: []});

console.log(result);

我希望每组连续的road_type的最大和为.as-console-wrapper { max-height: 100% !important; top: 0; };假设在此示例中values = ([ 'motorway' ] * 5) + ([ 'link' ] * 3) + ([ 'motorway' ] * 7) df = pd.DataFrame.from_dict({ 'timestamp': pd.date_range(start='2018-1-1', end='2018-1-2', freq='s').tolist()[:len(values)], 'road_type': values, }) df.set_index('timestamp') df['delta_t'] = (df['timestamp'] - df['timestamp'].shift()).fillna(0) 将是delta_t,我想找到delta_t1smotorway7s。实际上,将会有更多的road_type,而link会有所不同。

编辑:here提供的解决方案看起来很相似,但是它不求和,也不选择每个组中的最大组。

1 个答案:

答案 0 :(得分:0)

创建一个新列,用唯一的整数标记相同道路类型的每个“运行”,然后按该列进行分组并求和:

df['run'] = (df['road_type'] != df['road_type'].shift()).astype(int).cumsum()

df
             timestamp road_type  delta_t  run
0  2018-01-01 00:00:00  motorway 00:00:00    1
1  2018-01-01 00:00:01  motorway 00:00:01    1
2  2018-01-01 00:00:02  motorway 00:00:01    1
3  2018-01-01 00:00:03  motorway 00:00:01    1
4  2018-01-01 00:00:04  motorway 00:00:01    1
5  2018-01-01 00:00:05      link 00:00:01    2
6  2018-01-01 00:00:06      link 00:00:01    2
7  2018-01-01 00:00:07      link 00:00:01    2
8  2018-01-01 00:00:08  motorway 00:00:01    3
9  2018-01-01 00:00:09  motorway 00:00:01    3
10 2018-01-01 00:00:10  motorway 00:00:01    3
11 2018-01-01 00:00:11  motorway 00:00:01    3
12 2018-01-01 00:00:12  motorway 00:00:01    3
13 2018-01-01 00:00:13  motorway 00:00:01    3
14 2018-01-01 00:00:14  motorway 00:00:01    3


df.groupby('run').agg({'road_type': 'first', 'delta_t': 'sum'}).reset_index(drop=True).groupby('road_type').max()

           delta_t
road_type         
link      00:00:03
motorway  00:00:07