Question

我正在尝试对安然数据语料库进行二元分析：

for message in messages.find():
    sentences = [ s for s in nltk.tokenize.sent_tokenize(message["body"]) ]
    for sentence in sentences:
        words = words + PunktWordTokenizer().tokenize(sentence)
finder = BigramCollocationFinder.from_words(words)
print finder.nbest(bigram_measures.pmi, 20)

但是，当我看到“顶部”时，我看到一个核心正在饱和而其他核心处于空闲状态。我有什么方法可以将计算分发给所有其他核心（这是在Google Compute Engine上）

最高输出：

    Tasks: 117 total,   2 running, 115 sleeping,   0 stopped,   0 zombie
    %Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    %Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    %Cpu2  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    %Cpu3  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    %Cpu4  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    %Cpu5  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    %Cpu6  :  0.3 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    %Cpu7  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    KiB Mem:   7369132 total,  5303352 used,  2065780 free,    68752 buffers
    KiB Swap:        0 total,        0 used,        0 free,  4747800 cached

Answer 1

你真的不需要使用NLTK - 将它用于标记化，而是自己编写一个简单的并行bigram计算函数。您可能需要考虑使用内置的map和reduce函数来实现此目的。 Unigram Frequency Calculation示例将解释这两个函数的使用。你可以扩展它来计算双字母。

如何将NLTK计算扩展到多个核心？

1 个答案: