使用cumsum查找独特的章节

时间:2018-08-04 08:22:30

标签: python pandas cumsum

我有一个这样的数据框:

df = pd.DataFrame()

  text      secFlag  
0  book     1 
1  headings 1 
2  chapter  1 
3  one      1
4  page     0 
5  one      0
6  text     0
7  chapter   1 
8  two       1 
9  page     0 
10  two      0
11  text     0 
12  page      0
13  three     0
10  text      0
11  chapter   1 
12  three     1
13  something  0

我想找到累加的总和,以便可以用运行中的索引号标记属于特定章节的所有页面。

**Desired output**


  text      secFlag  chapter
0  book     1       1
1  headings 1       1
2  chapter  1       2
3  one      1       2
4  page     0       2
5  one      0       2
6  text     0       2
7  chapter   1      3
8  two       1      3
9  page     0      3
10  two      0     3
11  text     0      3
12  page      0     3
13  three     0     3
10  text      0     3
11  chapter   1      4
12  three     1     4
13  something  0     4

这是我尝试过的:

df['chapter'] = ((df['secFlag'].shift(-1) == 1)).cumsum()

但是,这并没有给我想要的输出,因为节标记中的值一旦为1,它就会增加。请注意,文本中包含多个单词,并且章节标题通常会包含多个单词。

您能建议一种简单的方法来完成此操作吗? 谢谢

1 个答案:

答案 0 :(得分:1)

如果1解决方案中第一个secFlag的需要标记是:

df['chapter'] = ((df['secFlag'] == 1) & (df['secFlag'] != df['secFlag'].shift())).cumsum()
print (df)
         text  secFlag  chapter
0        book        1        1
1    headings        1        1
2     chapter        1        1
3         one        1        1
4        page        0        1
5         one        0        1
6        text        0        1
7     chapter        1        2
8         two        1        2
9        page        0        2
10        two        0        2
11       text        0        2
12       page        0        2
13      three        0        2
10       text        0        2
11    chapter        1        3
12      three        1        3
13  something        0        3

详细信息

a = (df['secFlag'] == 1)
b = (df['secFlag'] != df['secFlag'].shift())
c = a & b
d = c.cumsum()

print (pd.concat([df,a,b,c,d], 
                 axis=1, 
                 keys=('orig','==1','!=shifted','chained by &','cumsum')))
         orig             ==1 !=shifted chained by &  cumsum
         text secFlag secFlag   secFlag      secFlag secFlag
0        book       1    True      True         True       1
1    headings       1    True     False        False       1
2     chapter       1    True     False        False       1
3         one       1    True     False        False       1
4        page       0   False      True        False       1
5         one       0   False     False        False       1
6        text       0   False     False        False       1
7     chapter       1    True      True         True       2
8         two       1    True     False        False       2
9        page       0   False      True        False       2
10        two       0   False     False        False       2
11       text       0   False     False        False       2
12       page       0   False     False        False       2
13      three       0   False     False        False       2
10       text       0   False     False        False       2
11    chapter       1    True      True         True       3
12      three       1    True     False        False       3
13  something       0   False      True        False       3