Question

如何从文本中保留换行符，制表符，...等？目前，我可以在文本文档中删除多余的空格，并且还可以删除\ n，\ t，unicodes等。

text = 'Hello world \n I wrote some random    text    here \t \n\n. I am trying      to remove extra whitespace but keep line breaks, tabs, ...etc'
text = re.sub( '\s+', ' ', text).strip()
print(text)
print(type(text))

我尝试了这个，但是没有帮助。

import textwrap
textwrap.wrap(text,80,replace_whitespace=True)

当前输出：

Hello world I wrote some random text here . I am trying to remove extra whitespace but keep line breaks, tabs, ...etc
<class 'str'>

所需的输出：

Hello world \n I wrote some random text here \t \n\n. I am trying to remove extra whitespace but keep line breaks, tabs, ...etc

Answer 1

您告诉正则表达式匹配所有空格，而不仅仅是空格。如果您只想匹配空格，请不要使用\s，请使用实际的空格：

text = re.sub(' +', ' ', text).strip()

演示：

>>> import re
>>> text = 'Hello world \n I wrote some random    text    here \t \n\n. I am trying      to remove extra whitespace but keep line breaks, tabs, ...etc'
>>> re.sub(' +', ' ', text).strip()
'Hello world \n I wrote some random text here \t \n\n. I am trying to remove extra whitespace but keep line breaks, tabs, ...etc'

从Regular Expression Syntax section of the re module documenation到\s序列，含义：

\s

匹配Unicode空格字符（包括[ \t\n\r\f\v]以及许多其他字符，例如，印刷规则由多种语言规定的不间断空格）。如果使用ASCII标志，则仅匹配[ \t\n\r\f\v]。

Answer 2

您可以使用re.split和join：

>>> ' '.join(re.split(r'[ ]{2,}', text))
'Hello world \n I wrote some random text here \t \n\n. I am trying to remove extra whitespace but keep line breaks, tabs, ...etc'

关键元素是正则表达式[ ]{2,}，该表达式在实际' '个空格字符的运行中进行分割，这些字符长于2个空格。

您可以对re.sub使用相同的正则表达式：

>>> re.sub(r'[ ]{2,}', ' ', text)
'Hello world \n I wrote some random text here \t \n\n. I am trying to remove extra whitespace but keep line breaks, tabs, ...etc'

删除空格并保留\ n \ t ..等

2 个答案: