如何使POS n-gram更有效?

时间:2014-10-17 14:21:59

标签: python nlp svm

我正在使用SV n-gram作为功能进行SVM文本分类。但是我需要2个小时才能完成POS unigram。我有5000个文本,每个文本有300个单词。这是我的代码:

def posNgrams(s,n):
    '''Calculate POS n-grams and return a dictionary'''
    text = nltk.word_tokenize(s)
    text_tags = nltk.pos_tag(text)
    taglist = []
    output = {}
    for item in text_tags: 
        taglist.append(item[1])
    for i in xrange(len(taglist)-n+1):
        g = ' '.join(taglist[i:i+n])
        output.setdefault(g,0)
        output[g] += 1
    return output

我尝试了相同的方法来做字符n-gram,它只花了我几分钟。你能告诉我如何让我的POS n-gram更快吗?

1 个答案:

答案 0 :(得分:1)

使用来自inxi -C的这些规格的服务器:

CPU(s): 2 Hexa core Intel Xeon CPU E5-2430 v2s (-HT-MCP-SMP-) cache: 30720 KB flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx) 
Clock Speeds: 1: 2500.036 MHz

通常情况下,规范的答案是使用pos_tag_sents批量标记,但它看起来并不快。

让我们尝试在获得POS标签之前分析一些步骤(仅使用1个核心):

import time

from nltk.corpus import brown
from nltk import sent_tokenize, word_tokenize, pos_tag
from nltk import pos_tag_sents

# Load brown corpus
start = time.time()
brown_corpus = brown.raw()
loading_time = time.time() - start
print "Loading brown corpus took",  loading_time

# Sentence tokenizing corpus
start = time.time()
brown_sents = sent_tokenize(brown_corpus)
sent_time = time.time() - start
print "Sentence tokenizing corpus took", sent_time


# Word tokenizing corpus
start = time.time()
brown_words = [word_tokenize(i) for i in brown_sents]
word_time = time.time() - start
print "Word tokenizing corpus took", word_time

# Loading, sent_tokenize, word_tokenize all together.
start = time.time()
brown_words = [word_tokenize(s) for s in sent_tokenize(brown.raw())]
tokenize_time = time.time() - start
print "Loading and tokenizing corpus took", tokenize_time

# POS tagging one sentence at a time took.
start = time.time()
brown_tagged = [pos_tag(word_tokenize(s)) for s in sent_tokenize(brown.raw())]
tagging_time = time.time() - start
print "Tagging sentence by sentence took", tagging_time


# Using batch_pos_tag.
start = time.time()
brown_tagged = pos_tag_sents([word_tokenize(s) for s in sent_tokenize(brown.raw())])
tagging_time = time.time() - start
print "Tagging sentences by batch took", tagging_time

[OUT]:

Loading brown corpus took 0.154870033264
Sentence tokenizing corpus took 3.77206301689
Word tokenizing corpus took 13.982845068
Loading and tokenizing corpus took 17.8847839832
Tagging sentence by sentence took 1114.65085101
Tagging sentences by batch took 1104.63432097

注意:pos_tag_sents以前在NLTK3.0之前的版本中称为batch_pos_tag

总之,我认为你需要考虑其他POS标签来预处理你的数据,或者你必须使用threading来处理POS标签。