Question

我想使用NLTK处理一堆文本文件，将它们拆分为特定的关键字。因此，我尝试按照建议的by the documentation来“ 子类StreamBackedCorpusView并重写read_block()方法”。

class CustomCorpusView(StreamBackedCorpusView):

    def read_block(self, stream):
        block = stream.readline().split()
        print("wtf")
        return [] # obviously this is only for debugging

class CustomCorpusReader(PlaintextCorpusReader):
    CorpusView = CustomCorpusViewer

但是，我对继承的了解却很生疏，而且似乎没有考虑到我的优先考虑。

的输出

corpus = CustomCorpusReader("/path/to/files/", ".*")

print(corpus.words())

与

的输出相同

corpus = PlaintextCorpusReader("/path/to/files", ".*")

print(corpus.words())

我想我缺少明显的东西，但是呢？

Answer 1

The documentation实际上建议了两种定义自定义语料库视图的方法：

调用StreamBackedCorpusView构造函数，并通过block_reader参数提供块读取器功能。

子类StreamBackedCorpusView，并重写read_block（）方法。

这也表明第一种方法比较容易，实际上我设法使其工作如下：

from nltk.corpus import PlaintextCorpusReader
from nltk.corpus.reader.api import *

class CustomCorpusReader(PlaintextCorpusReader):

    def _custom_read_block(self, stream):
        block = stream.readline().split()
        print("wtf")
        return [] # obviously this is only for debugging

    def custom(self, fileids=None):
        return concat(
            [
                self.CorpusView(fileid, self._custom_read_block, encoding=enc)
                for (fileid, enc) in self.abspaths(fileids, True)
            ]
        )


corpus = CustomCorpusReader("/path/to/files/", ".*")

print(corpus.custom())

不考虑CorpusView.read_block（）的重写

1 个答案: