Question

我尝试将word文档（.docx）中的粘贴内容复制到.txt文件中，并使其由nltk语料库读取器读取以查找段落数。它将近30段作为一段返回。我在.txt文件中手动输入了一个换行符，它返回了30个段落。

import nltk
corpusReader = nltk.corpus.reader.plaintext.PlaintextCorpusReader(".", "d.txt")
print "Paragraphs =", len(corpusReader.paras())

PlaintextCorpus读者可以阅读.docx吗？
从.docx复制粘贴到.txt时，如何保留换行符？
有没有办法使用python，我打开.txt文件并找到？！或。或者......后跟一些空格（数字为4）并按＆＃34;输入＆＃34;自动创建换行符？断。

编辑1.

走 para_block_reader = read_line_block 路径，但它总是会额外提供一个段落数。

import nltk
from nltk.corpus.reader.util import *
corpusReader = nltk.corpus.reader.plaintext.PlaintextCorpusReader(".", "d.txt",para_block_reader=read_line_block)
print "Paragraphs =", len(corpusReader.paras())

Answer 1

PlainTextCorpus阅读器的源代码是this page上定义的第一个类，它非常简单。

它有子组件，如果你没有在构造函数中将它们联系起来，它会使用NLTK默认值

para_block_reader（默认：read_blankline_block），其中说明文档如何分解为段落。
sentence_tokenizer（默认：英语Punkt），其中说明如何将段落分成句子
word_tokenizer（默认WordPunctTokenizer()），其中说明如何将句子分为标记（单词和符号）。

请注意，在NLTK上，默认值可能会在不同版本中更改。我觉得默认的word_tokenizer曾经是Penn标记器。

回复：1。

没有PlaintextCorpus阅读器无法读取Docx。它只读纯文本。我相信你可以找到一个python库来转换它

Re 2

复制和粘贴是此网站的offtopic，请尝试超级用户。我建议您使用选项1并获取库进行转换。

Re 3

是的，您可以使用Regex进行搜索和替换。

 import re
 def breakup(mystring):
      return re.replace(mystring, r"(\.|\!|\.\.\.)    ", "\n")

但也许您可能想换掉para_block_reader或sent_tokenizer

Answer 2

明文语料库阅读器只能读取纯文本文件。有些Python库可以读取docx，但是它不能解决你的问题，即Word通过单个换行符分隔段落，但明文文档传统上将段落边界理解为空白行 - - 即两个连续的换行符。换句话说，您的导出方法保留换行符;只是它们还不够。

因此，有一种简单的方法来修复您的文本，以便在没有额外待办事项的情况下识别段落：一旦您写完了您的纯文本文件（您可以通过Word的df2 <- structure(list(ID = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L), .Label = c("1", "2", "3", "4" ,"5" ,"6", "7" ) ), sequence = structure(c(1L,2L, 7L, 8L, 3L, 4L, 5L), .Label = c(" actgat "," atagattg ", " atatagag ", " atggggg ", " atgtagtt ", " gggatgac ", " TATATCC ", " TTTTAAAT "), class = "factor"), peptides = structure(c(1L, 2L,7L,8L, 3L, 4L, 5L), .Label = c(" 56 ", " 85 ", " 31 ", " 36 ", "15", "10", "76", "98", "34", "76"), class = "factor"), n_project = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = " project ", class = "factor")), .Names = c("ID", "sequence", "peptides", "n_project"), class = "data.frame", row.names = c(NA, -7L))菜单或通过剪切来完成并粘贴），像这样进行后处理（根据需要添加Save As...个参数）：

encoding=

您现在可以阅读with open("my_plaintext.txt") as oldfile: content = oldfile.read() content = re.sub("\n", "\n\n", content) with open("my_plaintext_fixed.txt", "w") as newfile: newfile.write(content) PlaintextCorpusReader`，一切都会按预期工作。

NLTK语料库阅读器段落

2 个答案:

回复：1。

Re 2

Re 3