我有以下数据框
+-------+------------+--+
| index | keep | |
+-------+------------+--+
| 0 | not useful | |
| 1 | start_1 | |
| 2 | useful | |
| 3 | end_1 | |
| 4 | not useful | |
| 5 | start_2 | |
| 6 | useful | |
| 7 | useful | |
| 8 | end_2 | |
+-------+------------+--+
有两对字符串(start_1
,end_1
,start_2
,end_2
)表明,这些字符串之间的行是数据中唯一相关的行。因此,在下面的数据帧中,输出数据帧将仅由索引2、6、7的行组成(因为2在start_1和end_1之间;而6和7在start_2和end_2之间)
d = {'keep': ["not useful", "start_1", "useful", "end_1", "not useful", "start_2", "useful", "useful", "end_2"]}
df = pd.DataFrame(data=d)
解决此问题的最Pythonic / Pandas方法是什么? 谢谢
答案 0 :(得分:2)
这是一种方法(为清楚起见,只需几个步骤)。可能还有其他人:
df["sections"] = 0
df.loc[df.keep.str.startswith("start"), "sections"] = 1
df.loc[df.keep.str.startswith("end"), "sections"] = -1
df["in_section"] = df.sections.cumsum()
res = df[(df.in_section == 1) & ~df.keep.str.startswith("start")]
输出:
index keep sections in_section
2 2 useful 0 1
6 6 useful 0 1
7 7 useful 0 1