我想使用NLTK处理一堆文本文件,将它们拆分为特定的关键字。因此,我尝试按照建议的by the documentation来“ 子类StreamBackedCorpusView
并重写read_block()
方法”。
class CustomCorpusView(StreamBackedCorpusView):
def read_block(self, stream):
block = stream.readline().split()
print("wtf")
return [] # obviously this is only for debugging
class CustomCorpusReader(PlaintextCorpusReader):
CorpusView = CustomCorpusViewer
但是,我对继承的了解却很生疏,而且似乎没有考虑到我的优先考虑。
的输出corpus = CustomCorpusReader("/path/to/files/", ".*")
print(corpus.words())
与
的输出相同corpus = PlaintextCorpusReader("/path/to/files", ".*")
print(corpus.words())
我想我缺少明显的东西,但是呢?
答案 0 :(得分:0)
The documentation实际上建议了两种定义自定义语料库视图的方法:
- 调用StreamBackedCorpusView构造函数,并通过block_reader参数提供块读取器功能。
- 子类StreamBackedCorpusView,并重写read_block()方法。
这也表明第一种方法比较容易,实际上我设法使其工作如下:
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus.reader.api import *
class CustomCorpusReader(PlaintextCorpusReader):
def _custom_read_block(self, stream):
block = stream.readline().split()
print("wtf")
return [] # obviously this is only for debugging
def custom(self, fileids=None):
return concat(
[
self.CorpusView(fileid, self._custom_read_block, encoding=enc)
for (fileid, enc) in self.abspaths(fileids, True)
]
)
corpus = CustomCorpusReader("/path/to/files/", ".*")
print(corpus.custom())