我有一个这样的数据框:
df = pd.DataFrame()
text secFlag
0 book 1
1 headings 1
2 chapter 1
3 one 1
4 page 0
5 one 0
6 text 0
7 chapter 1
8 two 1
9 page 0
10 two 0
11 text 0
12 page 0
13 three 0
10 text 0
11 chapter 1
12 three 1
13 something 0
我想找到累加的总和,以便可以用运行中的索引号标记属于特定章节的所有页面。
**Desired output**
text secFlag chapter
0 book 1 1
1 headings 1 1
2 chapter 1 2
3 one 1 2
4 page 0 2
5 one 0 2
6 text 0 2
7 chapter 1 3
8 two 1 3
9 page 0 3
10 two 0 3
11 text 0 3
12 page 0 3
13 three 0 3
10 text 0 3
11 chapter 1 4
12 three 1 4
13 something 0 4
这是我尝试过的:
df['chapter'] = ((df['secFlag'].shift(-1) == 1)).cumsum()
但是,这并没有给我想要的输出,因为节标记中的值一旦为1,它就会增加。请注意,文本中包含多个单词,并且章节标题通常会包含多个单词。
您能建议一种简单的方法来完成此操作吗? 谢谢
答案 0 :(得分:1)
如果1
解决方案中第一个secFlag
的需要标记是:
df['chapter'] = ((df['secFlag'] == 1) & (df['secFlag'] != df['secFlag'].shift())).cumsum()
print (df)
text secFlag chapter
0 book 1 1
1 headings 1 1
2 chapter 1 1
3 one 1 1
4 page 0 1
5 one 0 1
6 text 0 1
7 chapter 1 2
8 two 1 2
9 page 0 2
10 two 0 2
11 text 0 2
12 page 0 2
13 three 0 2
10 text 0 2
11 chapter 1 3
12 three 1 3
13 something 0 3
详细信息:
a = (df['secFlag'] == 1)
b = (df['secFlag'] != df['secFlag'].shift())
c = a & b
d = c.cumsum()
print (pd.concat([df,a,b,c,d],
axis=1,
keys=('orig','==1','!=shifted','chained by &','cumsum')))
orig ==1 !=shifted chained by & cumsum
text secFlag secFlag secFlag secFlag secFlag
0 book 1 True True True 1
1 headings 1 True False False 1
2 chapter 1 True False False 1
3 one 1 True False False 1
4 page 0 False True False 1
5 one 0 False False False 1
6 text 0 False False False 1
7 chapter 1 True True True 2
8 two 1 True False False 2
9 page 0 False True False 2
10 two 0 False False False 2
11 text 0 False False False 2
12 page 0 False False False 2
13 three 0 False False False 2
10 text 0 False False False 2
11 chapter 1 True True True 3
12 three 1 True False False 3
13 something 0 False True False 3