Question

我有一个Pandas Dataframe，其中一列带有文本。数据框具有带有换行符（\ n）的行。现在，我想将两个换行符之间的所有行分组。例如：

text_column
this is text
this is a new line

here starts a new paragraph
new line of new paragraph

next paragraph
...

我试图用以下行换行符标记行：

txt["doc"]=txt.text.str.match('\n')

此命令为我提供了一个包含true / false的新列。这不是我想要的。我寻找这个结果：

text_column                                              paragraph
this is text this a new line                                 1
here starts a new paragraph new line of new paragraph        2
next paragraph                                               3

我希望有人能提供帮助。

谢谢。

Answer 1

如果每个空字符串包含\n，仅将cumsum用于累积总和，并传递给groupby以用于聚合join，最后删除可能的尾随空格和段落并添加新列：

df = (txt.groupby(txt.text.str.match('\n').cumsum())['text']
         .agg(' '.join).str.strip().reset_index(drop=True).to_frame()
         .assign(paragraph = lambda x: range(1, len(x)+1)))

print (df)
                                                text  paragraph
0                    this is text this is a new line          1
1  here starts a new paragraph new line of new pa...          2
2                                     next paragraph          3

按段落分组熊猫数据框

1 个答案: