如何从文本中保留换行符,制表符,...等?目前,我可以在文本文档中删除多余的空格,并且还可以删除\ n,\ t,unicodes等。
text = 'Hello world \n I wrote some random text here \t \n\n. I am trying to remove extra whitespace but keep line breaks, tabs, ...etc'
text = re.sub( '\s+', ' ', text).strip()
print(text)
print(type(text))
我尝试了这个,但是没有帮助。
import textwrap
textwrap.wrap(text,80,replace_whitespace=True)
当前输出:
Hello world I wrote some random text here . I am trying to remove extra whitespace but keep line breaks, tabs, ...etc
<class 'str'>
所需的输出:
Hello world \n I wrote some random text here \t \n\n. I am trying to remove extra whitespace but keep line breaks, tabs, ...etc
答案 0 :(得分:5)
您告诉正则表达式匹配所有空格,而不仅仅是空格。如果您只想匹配空格,请不要使用\s
,请使用实际的空格:
text = re.sub(' +', ' ', text).strip()
演示:
>>> import re
>>> text = 'Hello world \n I wrote some random text here \t \n\n. I am trying to remove extra whitespace but keep line breaks, tabs, ...etc'
>>> re.sub(' +', ' ', text).strip()
'Hello world \n I wrote some random text here \t \n\n. I am trying to remove extra whitespace but keep line breaks, tabs, ...etc'
从Regular Expression Syntax section of the re
module documenation到\s
序列,含义:
\s
匹配Unicode空格字符(包括
[ \t\n\r\f\v]
以及许多其他字符,例如,印刷规则由多种语言规定的不间断空格)。如果使用ASCII标志,则仅匹配[ \t\n\r\f\v]
。
答案 1 :(得分:0)
您可以使用re.split
和join
:
>>> ' '.join(re.split(r'[ ]{2,}', text))
'Hello world \n I wrote some random text here \t \n\n. I am trying to remove extra whitespace but keep line breaks, tabs, ...etc'
关键元素是正则表达式[ ]{2,}
,该表达式在实际' '
个空格字符的运行中进行分割,这些字符长于2个空格。
您可以对re.sub
使用相同的正则表达式:
>>> re.sub(r'[ ]{2,}', ' ', text)
'Hello world \n I wrote some random text here \t \n\n. I am trying to remove extra whitespace but keep line breaks, tabs, ...etc'