Question

我正在尝试从文本文件（Project Gutenberg eBooks之一）打印句子列表。当我将文件打印为单个字符串字符串时，它看起来很好：

file = open('11.txt','r+')
alice = file.read()
print(alice[:500])

输出是：

ALICE'S ADVENTURES IN WONDERLAND

Lewis Carroll

THE MILLENNIUM FULCRUM EDITION 3.0




CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversations?'

So she was considering in her own mind (as well as she could, for the
hot d

现在，当我把它分成句子时（分配是专门用来做＃34;分割时期，＆＃34;所以它是一个非常简化的分割），我得到了这个：

>>> print(sentences[:5])
["ALICE'S ADVENTURES IN WONDERLAND\n\nLewis Carroll\n\nTHE MILLENNIUM FULCRUM EDITION 3", '0\n\n\n\n\nCHAPTER I', " Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversations?'\n\nSo she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her", "\n\nThere was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit say to itself, 'Oh dear!\nOh dear! I shall be late!' (when she thought it over afterwards, it\noccurred to her that she ought to have wondered at this, but at the time\nit all seemed quite natural); but when the Rabbit actually TOOK A WATCH\nOUT OF ITS WAISTCOAT-POCKET, and looked at it, and then hurried on,\nAlice started to her feet, for it flashed across her mind that she had\nnever before seen a rabbit with either a waistcoat-pocket, or a watch\nto take out of it, and burning with curiosity, she ran across the field\nafter it, and fortunately was just in time to see it pop down a large\nrabbit-hole under the hedge", '\n\nIn another moment down went Alice after it, never once considering how\nin the world she was to get out again']

额外的＆＃34; \ n＆＃34;字符来自何处以及如何删除它们？

Answer 1

如果要用一个空格替换所有换行符，请执行以下操作：

import re
new_sentences = [re.sub(r'\n+', ' ', s) for s in sentences]

Answer 2

您可能不想使用正则表达式，但我会这样做：

import re
new_sentences = []
for s in sentences:
    new_sentences.append(re.sub(r'\n{2,}', '\n', s))

这应该用一个换行符替换两个或多个'\n'的所有实例，因此您仍然有换行符，但没有“额外”换行符。

如果你想避免创建一个新的列表，而是修改现有的列表（归功于@gavriel和Andrew L：当我第一次发布我的答案时，我没想过使用枚举）：

import re
for i, s in enumerate(sentences):
    sentences[i] = re.sub(r'\n{2,}', '\n', s)

额外的换行并不是真正的额外换句，我的意思是它们应该存在并且在你的问题的文本中可见：'\n'越多，之间可见的空间就越大。文本行（即章节标题和第一段之间的一行，以及版本和章节标题之间的许多行。

Answer 3

您可以通过这个小例子了解\n字符的来源：

alice = """ALICE'S ADVENTURES IN WONDERLAND

Lewis Carroll

THE MILLENNIUM FULCRUM EDITION 3.0




CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversations?'

So she was considering in her own mind (as well as she could, for the
hot d"""

print len(alice.split("."))
print len(alice.split("\n"))

这完全取决于你分割文本的方式，上面的例子将给出这个输出：

3
19

如果您使用.分割文本，或者使用\n作为分隔符进行拆分，则表示有3个子字符串。您可以阅读有关str.split

的更多信息

在您的情况下，您使用.拆分了文本，因此3个子字符串将包含多个换行符\n，为了摆脱它们，您可以再次拆分这些子串或者只是摆脱它们他们使用str.replace

Answer 4

该文本使用换行符来分隔句子和句号。您遇到的问题是，只需用空字符串替换新行字符，就会导致单词之间没有空格。在您将alice分割为'.'之前，我会使用@ elethan解决方案中的某些内容替换alice中所有多个新行'.'然后你可以做alice.split('.')，并且用多个新行分隔的所有句子将与最初用.分隔的句子一起适当地分开。

然后您唯一的问题是版本号中的小数点。

Answer 5

file = open('11.txt','r+')
file.read().split('\n')

从python中的文本文件打印句子时删除＆＃34; \ n＆＃34;

5 个答案: