我有python读取长行,如果它们超过x个字符并将它们写入新文件,则将它们包装起来。我想出了如何确保单词不分开,但我有一个更具体的问题。我不希望特定的单词出现在一行的开头。经过几个小时的研究,我意识到我一直在走错路来解决这个问题并需要帮助。
这是我现在的代码:
with txtfile as infile, testfile as outfile:
for line in infile:
if len(line) > 80 and any(word in line[77:] for word in connectives):
outfile.write(textwrap.fill(line,96,replace_whitespace=False))
elif len(line) > 80 and not any(word in line[77:] for word in connectives):
outfile.write(textwrap.fill(line,80,replace_whitespace=False))
else:
outfile.write(line)
对我试图做的一个小解释:现在它读取了几百个字符的行,如果它超过80个字符,它将它包装到80.我以为我会看到最后几个字符该行包含我所定位的任何单词,如果是,我会延长这些行的换行,以便目标单词不会被放到下一行。 但是我已经意识到这是错误的思考(也许是愚蠢的更好),因为if语句检查了几百个字符的第一行。然后它不会检查后续行,因为它包装。最后,我可以避免在第一行打破错误的单词,而不是后续行。
由于textwrap
如果你不想要它也不会分解整个单词,我希望有一种方法可以告诉它不允许某些单词或字符被删除到下一个单词线。
或者,也许有一种方法可以读取被包裹的内容,并且只要某个特定单词出现在该行的第一个单词上,然后将其移动到上一行的末尾。
答案 0 :(得分:1)
你或许可以破解textwrap
来做你想做的事情,同时这里有你想做的事情的片段。基本的自动换行代码是维基百科文章中标题为Line wrap and word wrap的那一部分的算法改编。
当遇到不能在下一行开头的单词时,它们只会添加到当前单词中(从技术上讲它太长了)。如果您发现这是不可接受的,至少这将为您提供尝试其他方法的代码库。
import re
def textsplitter(text):
for match_obj in re.finditer(r'\w+\S+', sample_text):
match_str = match_obj.group()
submatch_obj = re.match(r'(\w+)(\S*)', match_str)
yield submatch_obj.groups()
def textwrapper(text, width=79, **kwargs):
taboo = set(kwargs.get('taboo', [])) # Words that can't be first.
result = []
spaceleft = width
for word, suffix in textsplitter(text):
phrase = word + suffix # Note suffix might be empty string ''.
if word in taboo: # Can't be first, so just add it.
result.append(phrase)
spaceleft = 0
else: # Add word, possibly with an inserted linebreak.
if len(phrase) > spaceleft:
result.append('\n'+phrase) # Insert linebreak before word.
spaceleft = width - len(phrase)
else:
result.append(phrase)
spaceleft = spaceleft - (len(phrase) + 1)
return ' '.join(result)
sample_text = """\
Lorem ipsum dolor sit amet, consectetur adipiscing elit. In molestie lectus
nulla, at aliquam dolor suscipit ac. Mauris vitae purus non est vehicula dictum.
Integer varius diam tellus, quis cursus lacus sollicitudin sed. Nulla eu quam
nec felis egestas tristique eu placerat est. Praesent tincidunt libero in
aliquet euismod. Pellentesque eu odio mollis, consequat eros in, vestibulum
mauris. Aenean gravida dolor et ligula cursus laoreet.
"""
print('Wrapped with no taboo words:\n')
print(textwrapper(sample_text, 40))
print('\n'*2)
taboo = ['adipiscing', 'aliquam'] # Not allowed to appear at start of lines.
print('Wrapped again with taboo words {}:\n'.format(taboo))
print(textwrapper(sample_text, 40, taboo=taboo))
输出:
Wrapped with no taboo words:
Lorem ipsum dolor sit amet, consectetur
adipiscing elit. In molestie lectus
nulla, at aliquam dolor suscipit ac.
Mauris vitae purus non est vehicula
dictum. Integer varius diam tellus, quis
cursus lacus sollicitudin sed. Nulla eu
quam nec felis egestas tristique eu
placerat est. Praesent tincidunt libero
in aliquet euismod. Pellentesque eu odio
mollis, consequat eros in, vestibulum
mauris. Aenean gravida dolor et ligula
cursus laoreet.
Wrapped again with taboo words ['adipiscing', 'aliquam']:
Lorem ipsum dolor sit amet, consectetur adipiscing
elit. In molestie lectus nulla, at aliquam
dolor suscipit ac. Mauris vitae purus non
est vehicula dictum. Integer varius diam
tellus, quis cursus lacus sollicitudin
sed. Nulla eu quam nec felis egestas
tristique eu placerat est. Praesent
tincidunt libero in aliquet euismod.
Pellentesque eu odio mollis, consequat
eros in, vestibulum mauris. Aenean
gravida dolor et ligula cursus laoreet.