Question

我正在分割文本para并使用以下命令保留换行符\n

from nltk import SpaceTokenizer
para="\n[STUFF]\n  comma,  with period. the new question? \n\nthe\n  \nline\n new char*"
sent=SpaceTokenizer().tokenize(para)

以下哪一项给我 print(sent)

['\n[STUFF]\n', '', 'comma,', '', 'with', 'period.', 'the', 'new', 'question?', '\n\nthe\n', '', '\nline\n', 'new', 'char*']

我的目标是获得以下输出

['\n[STUFF]\n', '', 'comma', ',', '', 'with', 'period', '.', 'the', 'new', 'question', '?', '\n\nthe\n', '', '\nline\n', 'new', 'char*']

也就是说，我想将'comma,'分解为'comma'，',' 分解 {{1 }}放入'period.'，'period' 分割将'.'放入'question?'，'question' '?'保留{{1 }}

我尝试过while，它将实现拆分\n，word_tokenize等，但是不会保留'comma'

如何在保留','的同时进一步拆分\n？

Answer 1

https://docs.python.org/3/library/re.html#re.split可能就是您想要的。

但是，从所需输出的外观来看，您将需要对字符串进行更多处理，而不仅仅是对其应用单个函数。

我首先将所有\n替换为诸如new_line_goes_here之类的字符串，然后再拆分该字符串，然后将new_line_goes_here替换为\n

Answer 2

每个@randy建议看起来https://docs.python.org/3/library/re.html#re.split

import re
para = re.split(r'(\W+)', '\n[STUFF]\n  comma,  with period. the new question? \n\nthe\n  \nline\n new char*')
print(para)

输出（接近我要寻找的内容）

['', '\n[', 'STUFF', ']\n  ', 'comma', ',  ', 'with', ' ', 'period', '. ', 'the', ' ', 'new', ' ', 'question', '? \n\n', 'the', '\n  \n', 'line', '\n ', 'new', ' ', 'char', '*', '']

在保留换行符的同时进一步拆分文本

2 个答案: