当熊猫中存在混合列数据时,添加条件滚动计数

时间:2019-07-23 18:51:31

标签: python pandas dataframe

我有一个CSV文件,如下所示:

Timestamp       Surface_Data
8737.37         Maze_A
8737.42         Maze_A
8740.40         Phone_Surface
8743.23         Desktop_Surface
8765.26         Phone_Surface
8765.29         Maze_A
8765.30         Phone_Surface
8765.56         Maze_B
8766.16         Maze_B
8783.74         Maze_A
8793.20         Maze_A
8840.12         Phone_Surface
8840.40         Phone_Surface
8841.40         Maze_B

我想添加一列来统计Maze_A到Maze_B或Maze_B到Maze_A的变化,它的外观必须类似于:

Timestamp       Surface_Data         Maze_Count
8737.37         Maze_A               1
8737.42         Maze_A
8740.40         Phone_Surface
8743.23         Desktop_Surface
8765.26         Phone_Surface
8765.29         Maze_A
8765.30         Phone_Surface
8765.56         Maze_B               2
8766.16         Maze_B
8783.74         Maze_A               3
8793.20         Maze_A
8840.12         Phone_Surface
8840.40         Phone_Surface
8841.40         Maze_B               4

当“ Surface_Data”列中的值发生更改时,我尝试使用cumsum(),但是它考虑了所有更改,包括不需要的其他值。因此,我希望只有在遇到Maze_A或Maze_B值时才会增加。

3 个答案:

答案 0 :(得分:2)

shiftwherecumsum

s = df.Surface_Data
c = s.where(s.str.match('^Maze_[AB]$')).ffill()
d = c.ne(c.shift())

df.assign(Maze_Count=d.cumsum().where(d, ''))

    Timestamp     Surface_Data Maze_Count
0     8737.37           Maze_A          1
1     8737.42           Maze_A           
2     8740.40    Phone_Surface           
3     8743.23  Desktop_Surface           
4     8765.26    Phone_Surface           
5     8765.29           Maze_A           
6     8765.30    Phone_Surface           
7     8765.56           Maze_B          2
8     8766.16           Maze_B           
9     8783.74           Maze_A          3
10    8793.20           Maze_A           
11    8840.12    Phone_Surface           
12    8840.40    Phone_Surface           
13    8841.40           Maze_B          4

答案 1 :(得分:1)

一次尝试:

c = df['Surface_Data'].str.contains('Maze')

df['Maze_Count'] = df.loc[c, 'Surface_Data'].ne(df.loc[c, 'Surface_Data'].shift()
                                               ).astype(int).replace(0, np.nan).cumsum()
    Timestamp     Surface_Data  Maze_Count
0     8737.37           Maze_A         1.0
1     8737.42           Maze_A         NaN
2     8740.40    Phone_Surface         NaN
3     8743.23  Desktop_Surface         NaN
4     8765.26    Phone_Surface         NaN
5     8765.29           Maze_A         NaN
6     8765.30    Phone_Surface         NaN
7     8765.56           Maze_B         2.0
8     8766.16           Maze_B         NaN
9     8783.74           Maze_A         3.0
10    8793.20           Maze_A         NaN
11    8840.12    Phone_Surface         NaN
12    8840.40    Phone_Surface         NaN
13    8841.40           Maze_B         4.0

答案 2 :(得分:1)

您还可以尝试过滤“ Maze_A”和“ Maze_B”的数据框,使用shift然后使用cumsumdrop_duplicates查找更改,最后,assign返回使用内部索引对齐的数据框:

x = df.loc[df['Surface_Data'].isin(['Maze_A','Maze_B']), 'Surface_Data']
df.assign(Maze_count=(x != x.shift()).cumsum().drop_duplicates())

输出:

    Timestamp     Surface_Data  Maze_count
0     8737.37           Maze_A         1.0
1     8737.42           Maze_A         NaN
2     8740.40    Phone_Surface         NaN
3     8743.23  Desktop_Surface         NaN
4     8765.26    Phone_Surface         NaN
5     8765.29           Maze_A         NaN
6     8765.30    Phone_Surface         NaN
7     8765.56           Maze_B         2.0
8     8766.16           Maze_B         NaN
9     8783.74           Maze_A         3.0
10    8793.20           Maze_A         NaN
11    8840.12    Phone_Surface         NaN
12    8840.40    Phone_Surface         NaN
13    8841.40           Maze_B         4.0