如何在Python中优化字数统计?

时间:2014-04-03 22:06:30

标签: python-2.7 optimization nlp nltk

我正在迈出第一步编写代码来对文本进行语言分析。我使用Python和NLTK库。问题是实际的单词计数占我CPU的近100%(iCore5,8GB RAM,macbook air 2014)并运行了14个小时才关闭了进程。如何加快循环和计数?

我已经在三个瑞典UTF-8格式的标签分隔文件Swe_Newspapers.txt,Swe_Blogs.txt,Swe_Twitter.txt中用NLTK创建了一个语料库。它工作正常:

import nltk
my_corpus = nltk.corpus.CategorizedPlaintextCorpusReader(".", r"Swe_.*", cat_pattern=r"Swe_(\w+)\.txt")

然后我加载了一个文本文件,每行一个字进入NLTK。这也很好。

my_wordlist = nltk.corpus.WordListCorpusReader("/Users/mos/Documents/", "wordlist.txt")

我要分析的文本文件(Swe_Blogs.txt)具有这种结构,并且可以正常解析:

Wordpress.com   2010/12/08  3   1,4,11  osv osv osv …
bloggagratis.se 2010/02/02  3   0   Jag är utled på plogade vägar, matte är lika utled hon.
wordpress.com   2010/03/10  3   0   1 kruka Sallad, riven

编辑:如下制作计数器的建议不起作用,但可以修复:

counter = collections.Counter(word for word in my_corpus.words(categories=["Blogs"]) if word in my_wordlist)

这会产生错误:

IOError                                   Traceback (most recent call last)
<ipython-input-41-1868952ba9b1> in <module>()
----> 1 counter = collections.Counter(word for word in my_corpus.words("Blogs") if word    in my_wordlist)
       /usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/reader/plaintext.pyc in words(self, fileids, categories)
182     def words(self, fileids=None, categories=None):
183         return PlaintextCorpusReader.words(
--> 184             self, self._resolve(fileids, categories))
185     def sents(self, fileids=None, categories=None):
186         return PlaintextCorpusReader.sents(

                /usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site- packages/nltk/corpus/reader/plaintext.pyc in words(self, fileids, sourced)
 89                                            encoding=enc)
 90                            for (path, enc, fileid)
 ---> 91                            in self.abspaths(fileids, True, True)])
 92 
 93 
 /usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/reader/api.pyc in abspaths(self, fileids, include_encoding, include_fileid)
165             fileids = [fileids]
166 
--> 167         paths = [self._root.join(f) for f in fileids]
168 
169         if include_encoding and include_fileid:  

/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/      lib/python2.7/site-packages/nltk/data.pyc in join(self, fileid)
174     def join(self, fileid):
175         path = os.path.join(self._path, *fileid.split('/'))
--> 176         return FileSystemPathPointer(path)
177 
178     def __repr__(self):

/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/  lib/python2.7/site-packages/nltk/data.pyc in __init__(self, path)
152         path = os.path.abspath(path)
153         if not os.path.exists(path):
--> 154             raise IOError('No such file or directory: %r' % path)
155         self._path = path

IOError: No such file or directory: '/Users/mos/Documents/Blogs'

修复方法是将my_corpus(categories = [“Blogs”]指定给变量:

blogs_text = my_corpus.words(categories=["Blogs"])

当我尝试计算语料库(115,7 MB)中的博客中的单词列表中每个单词(大约20K单词)的所有出现时,我的计算机有点累。如何加快以下代码?它似乎工作,没有错误消息,但执行时需要> 14h。

import collections
counter = collections.Counter()

for word in my_corpus.words(categories="Blogs"):
    for token in my_wordlist.words():
        if token == word:
            counter[token]+=1
        else:
            continue

非常感谢任何有助于提高编码技能的帮助!

2 个答案:

答案 0 :(得分:3)

似乎你的双循环可以改进:

for word in mycorp.words(categories="Blogs"):
    for token in my_wordlist.words():
        if token == word:
            counter[token]+=1

这会快得多:

words = set(my_wordlist.words()) # call once, make set for fast check
for word in mycorp.words(categories="Blogs"):
    if word in words:
        counter[word] += 1

这会让您从执行len(my_wordlist.words()) * len(mycorp.words(...))操作到更接近len(my_wordlist.words()) + len(mycorp.words(...))操作,因为构建集合为O(n)并检查集合中的单词是否为O(1)平均。

你也可以从迭代中构建Counter直接,就像两位炼金术士所指出的那样:

counter = Counter(word for word in mycorp.words(categories="Blogs") 
                  if word in words)

答案 1 :(得分:3)

对于如何使用Python正确计算单词,你已经得到了很好的答案。问题是它仍然会很慢。如果您只是在探索语料库,使用一系列UNIX工具可以更快地获得结果。假设你的文本是标记化的,那么这样的东西会按降序给你前100个标记:

cat Swe_Blogs.txt | cut --delimiter='\t' --fields=5 | tr ' ' '\n' | sort | uniq -c | sort -nr | head -n 100