我有一堆带有硬线包装的文本文件(即大约80个字符的新行)。我想撤消这一点,并将所有这些句子加在一起,但要保留新的行,它们是新的章节或段落。
即我喜欢替换' \ n'用' '当且仅当以下字符不是另一个' \ n'
时以下python代码可以实现我想要的功能,但效率不高,我宁愿使用正则表达式和/或sed执行此操作。
s = open(filename, 'r').read()
p = s.split('\n\n') # split into paragraphs
p = [x.replace('\n', ' ') for x in p] # iterate all paragraphs, replace \n
s2 = '\n\n'.join(p) # join paragraphs back together
e.g。
Lorem ipsum dolor sit amet, consectetur adipiscing
elit. Vivamus porta dui quis aliquet interdum. Sed
in pellentesque libero. Quisque tempus nisl nec
nisl condimentum ullamcorper.
Mauris vulputate nibh nec ipsum mattis rutrum.
Nunc nec tristique magna, non sagittis lacus.
Aliquam id urna lectus.
Maecenas volutpat libero quis erat mollis, et
aliquet purus dignissim. Sed faucibus, lectus in
auctor ornare, dolor libero ultrices sem, vel
iaculis ex nulla quis lacus.
应该成为:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus porta dui quis aliquet interdum. Sed in pellentesque libero. Quisque tempus nisl nec nisl condimentum ullamcorper.
Mauris vulputate nibh nec ipsum mattis rutrum. Nunc nec tristique magna, non sagittis lacus. Aliquam id urna lectus.
Maecenas volutpat libero quis erat mollis, et aliquet purus dignissim. Sed faucibus, lectus in auctor ornare, dolor libero ultrices sem, vel iaculis ex nulla quis lacus.
更新
我已尝试在5MB文本文件上定时下面的5个python方法。我惊讶地发现所有3种正则表达式方法都比python split / replace / join方法慢一个数量级。
def m1(s):
p = s.split('\n\n') # split into paragraphs
p = [x.replace('\n', ' ') for x in p] # iterate all paragraphs, replace \n
r = '\n\n'.join(p) # join paragraphs back together
return r
def m2(s):
r = re.sub(r"(?<!\n)\n(?!\n)", " ", s)
return r
def m3(s):
p = re.compile(ur'(?<!^)\n(?=\S)', re.MULTILINE)
r = re.sub(p, u" ", s)
return r
def m4(s):
r = "".join(["".join(v) if k else " ".join(map(str.strip, v))+"\n" for k, v in groupby(s, str.isspace)])
return r
def repl(m):
return (' ' if len(m.group(1))==1 else m.group(1)) + m.group(2)
def m5(s):
r = re.sub(r'(\n+)(.)', repl, s)
return r
结果:
np.array( timeit.repeat('r=m1(s)', 'from __main__ import *', repeat=5, number=N) )/N
Out[4]: array([ 0.01343679, 0.0136183 , 0.0153013 , 0.0122381 , 0.01205051])
np.array( timeit.repeat('r=m2(s)', 'from __main__ import *', repeat=5, number=N) )/N
Out[5]: array([ 0.10881839, 0.108728 , 0.10904381, 0.10862441, 0.10867569])
np.array( timeit.repeat('r=m3(s)', 'from __main__ import *', repeat=5, number=N) )/N
Out[6]: array([ 0.1358021 , 0.1352592 , 0.13556101, 0.1357465 , 0.1354876 ])
np.array( timeit.repeat('r=m4(s)', 'from __main__ import *', repeat=5, number=N) )/N
Out[7]: array([ 2.51403842, 2.37821078, 2.4169096 , 2.56688828, 2.36240571])
np.array( timeit.repeat('r=m5(s)', 'from __main__ import *', repeat=5, number=N) )/N
Out[8]: array([ 0.16381941, 0.1616353 , 0.1620033 , 0.1617353 , 0.1615443 ])
答案 0 :(得分:0)
使用re.sub()
,然后你必须使用否定
后视和前瞻性断言。如果您的输入很大,这将不会非常有效。
后视:
(?<!...)
Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.
预见:
(?!...)
Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if
it’s not followed by 'Asimov'.
以下是一个例子:
>>> text = """Lorem ipsum dolor sit amet, consectetur adipiscing
elit. Vivamus porta dui quis aliquet interdum. Sed
in pellentesque libero. Quisque tempus nisl nec
nisl condimentum ullamcorper.
Mauris vulputate nibh nec ipsum mattis rutrum.
Nunc nec tristique magna, non sagittis lacus.
Aliquam id urna lectus.
Maecenas volutpat libero quis erat mollis, et
aliquet purus dignissim. Sed faucibus, lectus in
auctor ornare, dolor libero ultrices sem, vel
iaculis ex nulla quis lacus."""
>>> re.sub(r"(?<!\n)\n(?!\n)", " ", text)
'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus porta dui quis aliquet interdum. Sed in pellentesque libero. Quisque tempus nisl nec nisl condimentum ullamcorper.\n\nMauris vulputate nibh nec ipsum mattis rutrum. Nunc nec tristique magna, non sagittis lacus. Aliquam id urna lectus.\n\nMaecenas volutpat libero quis erat mollis, et aliquet purus dignissim. Sed faucibus, lectus in auctor ornare, dolor libero ultrices sem, vel iaculis ex nulla quis lacus.'
>>> print(_)
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus porta dui quis aliquet interdum. Sed in pellentesque libero. Quisque tempus nisl nec nisl condimentum ullamcorper.
Mauris vulputate nibh nec ipsum mattis rutrum. Nunc nec tristique magna, non sagittis lacus. Aliquam id urna lectus.
Maecenas volutpat libero quis erat mollis, et aliquet purus dignissim. Sed faucibus, lectus in auctor ornare, dolor libero ultrices sem, vel iaculis ex nulla quis lacus.
答案 1 :(得分:0)
您可以使用awk
,例如:
awk '{$1=$1}1' RS='' ORS='\n\n' OFS=' ' file
说明:
{$1=$1}
看起来不会改变任何东西。这是真的,但仍然awk
将使用新的分隔符重新组合记录(见下文)
1
始终评估为true,因为未指定任何操作,awk将打印整个当前记录
RS=''
位于输入记录分隔符中。空字符串是一个特殊值。它表示按空行拆分记录,按新行划分字段。
ORS='\n\n'
将输出记录分隔符设置为空白行。
OFS=' '
将输出字段分隔符设置为空格。
输出:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus porta dui quis aliquet interdum. Sed in pellentesque libero. Quisque tempus nisl nec nisl condimentum ullamcorper.
Mauris vulputate nibh nec ipsum mattis rutrum. Nunc nec tristique magna, non sagittis lacus. Aliquam id urna lectus.
Maecenas volutpat libero quis erat mollis, et aliquet purus dignissim. Sed faucibus, lectus in auctor ornare, dolor libero ultrices sem, vel iaculis ex nulla quis lacus.
答案 2 :(得分:0)
您可以使用groupby,对空白进行分组:
from itertools import groupby
with open("test.txt") as f:
print("".join(["".join(v) if k else " ".join(map(str.strip, v))+"\n" for k, v in groupby(f, str.isspace)]))
哪会给你:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus porta dui quis aliquet interdum. Sed in pellentesque libero. Quisque tempus nisl nec nisl condimentum ullamcorper.
Mauris vulputate nibh nec ipsum mattis rutrum. Nunc nec tristique magna, non sagittis lacus. Aliquam id urna lectus.
Maecenas volutpat libero quis erat mollis, et aliquet purus dignissim. Sed faucibus, lectus in auctor ornare, dolor libero ultrices sem, vel iaculis ex nulla quis lacus.
答案 3 :(得分:0)
我试着在python中使用正则表达式:
假设text
变量包含您的示例文本
import re
p = re.compile(ur'(?<!^)\n(?=\S)', re.MULTILINE)
result = re.sub(p, u" ", text)
print(result)
它将打印以下文本,用空格替换单个换行符。
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus porta dui quis aliquet interdum. Sed in pellentesque libero. Quisque tempus nisl nec nisl condimentum ullamcorper.
Mauris vulputate nibh nec ipsum mattis rutrum. Nunc nec tristique magna, non sagittis lacus. Aliquam id urna lectus.
Maecenas volutpat libero quis erat mollis, et aliquet purus dignissim. Sed faucibus, lectus in auctor ornare, dolor libero ultrices sem, vel iaculis ex nulla quis lacus.
请参阅regex101
上的演示答案 4 :(得分:0)
有时可以通过将函数作为第二个参数传递给re.sub()
来完成复杂的替换。
import re
ipsum = '''Lorem ipsum dolor sit amet, consectetur adipiscing
elit. Vivamus porta dui quis aliquet interdum. Sed
in pellentesque libero. Quisque tempus nisl nec
nisl condimentum ullamcorper.
Mauris vulputate nibh nec ipsum mattis rutrum.
Nunc nec tristique magna, non sagittis lacus.
Aliquam id urna lectus.
Maecenas volutpat libero quis erat mollis, et
aliquet purus dignissim. Sed faucibus, lectus in
auctor ornare, dolor libero ultrices sem, vel
iaculis ex nulla quis lacus.
'''
ipsum = re.sub(
r'(\n+)(?=.)',
lambda m: ' ' if len(m.group(1))==1 else m.group(1),
ipsum)
print ipsum