不考虑CorpusView.read_block()的重写

时间:2019-05-13 17:08:13

标签: python-3.x override nltk subclass

我想使用NLTK处理一堆文本文件,将它们拆分为特定的关键字。因此,我尝试按照建议的by the documentation来“ 子类StreamBackedCorpusView并重写read_block()方法”。

class CustomCorpusView(StreamBackedCorpusView):

    def read_block(self, stream):
        block = stream.readline().split()
        print("wtf")
        return [] # obviously this is only for debugging

class CustomCorpusReader(PlaintextCorpusReader):
    CorpusView = CustomCorpusViewer

但是,我对继承的了解却很生疏,而且似乎没有考虑到我的优先考虑。

的输出
corpus = CustomCorpusReader("/path/to/files/", ".*")

print(corpus.words())

的输出相同
corpus = PlaintextCorpusReader("/path/to/files", ".*")

print(corpus.words())

我想我缺少明显的东西,但是呢?

1 个答案:

答案 0 :(得分:0)

The documentation实际上建议了两种定义自定义语料库视图的方法:

  
      
  1. 调用StreamBackedCorpusView构造函数,并通过block_reader参数提供块读取器功能。
  2.   
  3. 子类StreamBackedCorpusView,并重写read_block()方法。
  4.   

这也表明第一种方法比较容易,实际上我设法使其工作如下:

from nltk.corpus import PlaintextCorpusReader
from nltk.corpus.reader.api import *

class CustomCorpusReader(PlaintextCorpusReader):

    def _custom_read_block(self, stream):
        block = stream.readline().split()
        print("wtf")
        return [] # obviously this is only for debugging

    def custom(self, fileids=None):
        return concat(
            [
                self.CorpusView(fileid, self._custom_read_block, encoding=enc)
                for (fileid, enc) in self.abspaths(fileids, True)
            ]
        )


corpus = CustomCorpusReader("/path/to/files/", ".*")

print(corpus.custom())