我需要在大型数据集上运行nltk.pos_tag,并且需要输出类似斯坦福标记器提供的输出。
例如,在运行以下代码时,我有;
import nltk
text=nltk.word_tokenize("We are going out.Just you and me.")
print nltk.pos_tag(text)
输出是: [('我们','PRP'),('是','VBP'),('去','VBG'),('out.Just','IN'),('你','PRP) '),('和','CC'),('我','PRP'),('。','。')]
在我需要它的情况下:
We/PRP are/VBP going/VBG out.Just/NN you/PRP and/CC me/PRP ./.
我更喜欢不使用字符串函数并且需要直接输出,因为文本量太高而且会给处理增加很多时间复杂度
答案 0 :(得分:3)
简而言之:
' '.join([word + '/' + pos for word, pos in tagged_sent]
长期:
我认为你过分思考使用字符串函数来连接字符串,它真的不那么昂贵。
import time
from nltk.corpus import brown
tagged_corpus = brown.tagged_sents()
start = time.time()
with open('output.txt', 'w') as fout:
for i, sent in enumerate(tagged_corpus):
print(' '.join([word + '/' + pos for word, pos in sent]), end='\n', file=fout)
end = time.time() - start
print (i, end)
我的笔记本电脑花了2.955秒来处理棕色语料库中的所有57339个句子。
[OUT]:
$ head -n1 output.txt
The/AT Fulton/NP-TL County/NN-TL Grand/JJ-TL Jury/NN-TL said/VBD Friday/NR an/AT investigation/NN of/IN Atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./.
但是当你需要读取你的标记输出时,使用字符串连接单词和POS可能会导致麻烦,例如
>>> from nltk import pos_tag
>>> tagged_sent = pos_tag('cat / dog'.split())
>>> tagged_sent_str = ' '.join([word + '/' + pos for word, pos in tagged_sent])
>>> tagged_sent_str
'cat/NN //CD dog/NN'
>>> [tuple(wordpos.split('/')) for wordpos in tagged_sent_str.split()]
[('cat', 'NN'), ('', '', 'CD'), ('dog', 'NN')]
如果您想保存已标记的输出,然后稍后阅读,最好使用pickle
保存tagged_output,例如
>>> import pickle
>>> tagged_sent = pos_tag('cat / dog'.split())
>>> with open('tagged_sent.pkl', 'wb') as fout:
... pickle.dump(tagged_sent, fout)
...
>>> tagged_sent = None
>>> tagged_sent
>>> with open('tagged_sent.pkl', 'rb') as fin:
... tagged_sent = pickle.load(fin)
...
>>> tagged_sent
[('cat', 'NN'), ('/', 'CD'), ('dog', 'NN')]