是否有人为NLTK编写了分类XML语料库阅读器?
我正在使用Annotated NYTimes语料库。这是一个XML语料库。 我可以用XMLCorpusReader读取文件,但我想使用一些NLTK的类别功能。有一个nice tutorial用于子类化NLTK读者。我可以继续写这篇文章,但希望能节省一些时间,如果有人已经这样做了。
如果不是,我会发布我写的内容。
答案 0 :(得分:1)
这是一个针对NLTK的分类XML语料库阅读器。它基于this tutorial. 这使您可以在XML Corpora上使用NLTK基于类别的功能,例如New York Times Annotated Corpus。
将此文件命名为CategorizedXMLCorpusReader.py并将其导入为:
import imp
CatXMLReader = imp.load_source('CategorizedXMLCorpusReader','PATH_TO_THIS_FILE/CategorizedXMLCorpusReader.py')
然后您可以像使用任何其他NLTK Reader一样使用它。例如,
CatXMLReader = CatXMLReader.CategorizedXMLCorpusReader('.../nltk_data/corpora/nytimes', file_ids, cat_file='PATH_TO_CATEGORIES_FILE')
我仍然在想NLTK,所以欢迎任何更正或建议。
# Categorized XML Corpus Reader
from nltk.corpus.reader import CategorizedCorpusReader, XMLCorpusReader
class CategorizedXMLCorpusReader(CategorizedCorpusReader, XMLCorpusReader):
def __init__(self, *args, **kwargs):
CategorizedCorpusReader.__init__(self, kwargs)
XMLCorpusReader.__init__(self, *args, **kwargs)
def _resolve(self, fileids, categories):
if fileids is not None and categories is not None:
raise ValueError('Specify fileids or categories, not both')
if categories is not None:
return self.fileids(categories)
else:
return fileids
# All of the following methods call the corresponding function in ChunkedCorpusReader
# with the value returned from _resolve(). We'll start with the plain text methods.
def raw(self, fileids=None, categories=None):
return XMLCorpusReader.raw(self, self._resolve(fileids, categories))
def words(self, fileids=None, categories=None):
#return CategorizedCorpusReader.words(self, self._resolve(fileids, categories))
# Can I just concat words over each file in a file list?
words=[]
fileids = self._resolve(fileids, categories)
# XMLCorpusReader.words works on one file at a time. Concatenate them here.
for fileid in fileids:
words+=XMLCorpusReader.words(self, fileid)
return words
# This returns a string of the text of the XML docs without any markup
def text(self, fileids=None, categories=None):
fileids = self._resolve(fileids, categories)
text = ""
for fileid in fileids:
for i in self.xml(fileid).getiterator():
if i.text:
text += i.text
return text
# This returns all text for a specified xml field
def fieldtext(self, fileids=None, categories=None):
# NEEDS TO BE WRITTEN
return
def sents(self, fileids=None, categories=None):
#return CategorizedCorpusReader.sents(self, self._resolve(fileids, categories))
text = self.words(fileids, categories)
sents=nltk.PunktSentenceTokenizer().tokenize(text)
return sents
def paras(self, fileids=None, categories=None):
return CategorizedCorpusReader.paras(self, self._resolve(fileids, categories))
答案 1 :(得分:0)
很抱歉NAD,但将其作为新问题发布是我发现讨论此代码的唯一方法。我也在尝试使用带有words()方法的类别时使用并发现了一个小错误。这里:https://github.com/nltk/nltk/issues/250#issuecomment-5273102
你在我之前遇到过这个问题吗?另外,您是否对其进行了任何进一步的修改,可能会使类别生效?我的电子邮件是在我的个人资料页面中,如果你想谈论它 - SO: - )