我使用NLTK和POS标记德语维基百科。结构非常简单,一个大的列表包含每个句子作为单词列表,POS标签元组示例:
[[(Word1,POS),(Word2,POS),...],[(Word1,POS),(Word2,POS),...],...]
因为维基百科很大,我显然无法将整个大清单存储在内存中,因此我需要一种方法将其中的部分内容保存到磁盘。以某种方式执行此操作的好方法是什么,以便我可以轻松地从磁盘迭代所有句子和单词?
答案 0 :(得分:1)
使用pickle
,请参阅https://wiki.python.org/moin/UsingPickle:
import io
import cPickle as pickle
from nltk import pos_tag
from nltk.corpus import brown
print brown.sents()
print
# Let's tag the first 10 sentences.
tagged_corpus = [pos_tag(i) for i in brown.sents()[:10]]
with io.open('brown.pos', 'wb') as fout:
pickle.dump(tagged_corpus, fout)
with io.open('brown.pos', 'rb') as fin:
loaded_corpus = pickle.load(fin)
for sent in loaded_corpus:
print sent
break
[OUT]:
[[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of', u"Atlanta's", u'recent', u'primary', u'election', u'produced', u'``', u'no', u'evidence', u"''", u'that', u'any', u'irregularities', u'took', u'place', u'.'], [u'The', u'jury', u'further', u'said', u'in', u'term-end', u'presentments', u'that', u'the', u'City', u'Executive', u'Committee', u',', u'which', u'had', u'over-all', u'charge', u'of', u'the', u'election', u',', u'``', u'deserves', u'the', u'praise', u'and', u'thanks', u'of', u'the', u'City', u'of', u'Atlanta', u"''", u'for', u'the', u'manner', u'in', u'which', u'the', u'election', u'was', u'conducted', u'.'], ...]
[(u'The', 'DT'), (u'Fulton', 'NNP'), (u'County', 'NNP'), (u'Grand', 'NNP'), (u'Jury', 'NNP'), (u'said', 'VBD'), (u'Friday', 'NNP'), (u'an', 'DT'), (u'investigation', 'NN'), (u'of', 'IN'), (u"Atlanta's", 'JJ'), (u'recent', 'JJ'), (u'primary', 'JJ'), (u'election', 'NN'), (u'produced', 'VBN'), (u'``', '``'), (u'no', 'DT'), (u'evidence', 'NN'), (u"''", "''"), (u'that', 'WDT'), (u'any', 'DT'), (u'irregularities', 'NNS'), (u'took', 'VBD'), (u'place', 'NN'), (u'.', '.')]
答案 1 :(得分:1)
正确要做的事情是以nltk的TaggedCorpusReader
期望的格式保存标记的语料库:使用斜杠/
来组合单词和标记,然后写每个令牌分开。也就是说,你最终会得到Word1/POS word2/POS word3/POS ...
。
由于某种原因,nltk没有提供这样做的功能。有一个功能可以组合一个单词和它的标签,这甚至不值得查找,因为它很容易直接完成整个事情:
for tagged_sent in tagged_sentences:
text = " ".join(w+"/"+t for w,t in tagged_sent)
outfile.write(text+"\n")
就是这样。稍后您可以使用TaggedCorpusReader
读取您的语料库并按照NLTK提供的常规方式(通过标记或未标记的单词,通过标记或未标记的句子)对其进行迭代。