Question

我有以下数据框

+-------+------------+--+
| index |    keep    |  |
+-------+------------+--+
|     0 | not useful |  |
|     1 | start_1    |  |
|     2 | useful     |  |
|     3 | end_1      |  |
|     4 | not useful |  |
|     5 | start_2    |  |
|     6 | useful     |  |
|     7 | useful     |  |
|     8 | end_2      |  |
+-------+------------+--+

有两对字符串（start_1，end_1，start_2，end_2）表明，这些字符串之间的行是数据中唯一相关的行。因此，在下面的数据帧中，输出数据帧将仅由索引2、6、7的行组成（因为2在start_1和end_1之间；而6和7在start_2和end_2之间）

d = {'keep': ["not useful", "start_1", "useful", "end_1", "not useful", "start_2", "useful", "useful", "end_2"]}
df = pd.DataFrame(data=d)

解决此问题的最Pythonic / Pandas方法是什么？谢谢

Answer 1

这是一种方法（为清楚起见，只需几个步骤）。可能还有其他人：

df["sections"] = 0
df.loc[df.keep.str.startswith("start"), "sections"] = 1
df.loc[df.keep.str.startswith("end"), "sections"] = -1
df["in_section"] = df.sections.cumsum()
res = df[(df.in_section == 1) & ~df.keep.str.startswith("start")]

输出：

   index    keep  sections  in_section
2      2  useful         0           1
6      6  useful         0           1
7      7  useful         0           1

熊猫根据其他行中的字符串保留某些行

1 个答案: