高效地计算python中的单词频率

时间:2016-03-08 01:52:22

标签: python nlp scikit-learn word-count frequency-distribution

我想计算文本文件中所有字词的频率。

>>> countInFile('test.txt')
如果目标文本文件如下,

应返回{'aaa':1, 'bbb': 2, 'ccc':1}

# test.txt
aaa bbb ccc
bbb

我在some posts之后使用纯python实现了它。但是,由于文件大小(> 1GB),我发现纯python方式不足。

我认为借用sklearn的力量是候选人。

如果你让CountVectorizer计算每一行的频率,我猜你会通过总结每一列来获得字频率。但是,这听起来有点间接的方式。

使用python计算文件中单词的最有效和直接的方法是什么?

更新

我的(非常慢)代码在这里:

from collections import Counter

def get_term_frequency_in_file(source_file_path):
    wordcount = {}
    with open(source_file_path) as f:
        for line in f:
            line = line.lower().translate(None, string.punctuation)
            this_wordcount = Counter(line.split())
            wordcount = add_merge_two_dict(wordcount, this_wordcount)
    return wordcount

def add_merge_two_dict(x, y):
    return { k: x.get(k, 0) + y.get(k, 0) for k in set(x) | set(y) }

8 个答案:

答案 0 :(得分:37)

最简洁的方法是使用Python提供的工具。

from future_builtins import map  # Only on Python 2

from collections import Counter
from itertools import chain

def countInFile(filename):
    with open(filename) as f:
        return Counter(chain.from_iterable(map(str.split, f)))

就是这样。 map(str.split, f)正在创建一个生成器,从每行返回list个单词。包裹在chain.from_iterable中将其转换为一次生成一个单词的单个生成器。 Counter接受输入迭代并计算其中的所有唯一值。最后,您return一个类似dict的对象(Counter)存储所有唯一字词及其计数,并且在创建过程中,您一次只能存储一行数据和总计数,而不是一次整个文件。

理论上,在Python 2.7和3.1上,您可以自己稍微更好地循环链接结果并使用dictcollections.defaultdict(int)来计算(因为Counter是在Python中实现的,在某些情况下可以使它变慢),但让Counter做的工作更简单,更自我记录(我的意思是,整个目标都在计算,所以使用Counter)。除此之外,在CPython(参考解释器)上,3.2及更高版本Counter具有一个C级加速器,用于计算可迭代输入,这些输入的运行速度比您在纯Python中编写的任何内容都快。

更新:您似乎希望删除标点符号并且不区分大小写,因此这是我之前代码的变体:

from string import punctuation

def countInFile(filename):
    with open(filename) as f:
        linewords = (line.translate(None, punctuation).lower().split() for line in f)
        return Counter(chain.from_iterable(linewords))

您的代码运行速度要慢得多,因为它创建并销毁了许多小Counterset个对象,而不是.update每行一次Counter(其中虽然比我在更新的代码块中给出的稍慢,但在缩放因子方面至少在算法上相似。)

答案 1 :(得分:9)

内存高效准确的方法是使用

  • scikit中的CountVectorizer(用于ngram提取)
  • NLTK for word_tokenize
  • numpy矩阵总和以收集计数
  • collections.Counter用于收集计数和词汇

一个例子:

import urllib.request
from collections import Counter

import numpy as np 

from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

# Our sample textfile.
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')


# Note that `ngram_range=(1, 1)` means we want to extract Unigrams, i.e. tokens.
ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
# X matrix where the row represents sentences and column is our one-hot vector for each token in our vocabulary
X = ngram_vectorizer.fit_transform(data.split('\n'))

# Vocabulary
vocab = list(ngram_vectorizer.get_feature_names())

# Column-wise sum of the X matrix.
# It's some crazy numpy syntax that looks horribly unpythonic
# For details, see http://stackoverflow.com/questions/3337301/numpy-matrix-to-array
# and http://stackoverflow.com/questions/13567345/how-to-calculate-the-sum-of-all-columns-of-a-2d-numpy-array-efficiently
counts = X.sum(axis=0).A1

freq_distribution = Counter(dict(zip(vocab, counts)))
print (freq_distribution.most_common(10))

[OUT]:

[(',', 32000),
 ('.', 17783),
 ('de', 11225),
 ('a', 7197),
 ('que', 5710),
 ('la', 4732),
 ('je', 4304),
 ('se', 4013),
 ('на', 3978),
 ('na', 3834)]

基本上,你也可以这样做:

from collections import Counter
import numpy as np 
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

def freq_dist(data):
    """
    :param data: A string with sentences separated by '\n'
    :type data: str
    """
    ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
    X = ngram_vectorizer.fit_transform(data.split('\n'))
    vocab = list(ngram_vectorizer.get_feature_names())
    counts = X.sum(axis=0).A1
    return Counter(dict(zip(vocab, counts)))

我们timeit

import time

start = time.time()
word_distribution = freq_dist(data)
print (time.time() - start)

[OUT]:

5.257147789001465

请注意,CountVectorizer也可以使用文件而不是字符串,而此处无需将整个文件读入内存。在代码中:

import io
from collections import Counter

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

infile = '/path/to/input.txt'

ngram_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1), min_df=1)

with io.open(infile, 'r', encoding='utf8') as fin:
    X = ngram_vectorizer.fit_transform(fin)
    vocab = ngram_vectorizer.get_feature_names()
    counts = X.sum(axis=0).A1
    freq_distribution = Counter(dict(zip(vocab, counts)))
    print (freq_distribution.most_common(10))

答案 2 :(得分:3)

这是一些基准。它看起来很奇怪,但最粗糙的代码获胜。

[代码]:

from collections import Counter, defaultdict
import io, time

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

infile = '/path/to/file'

def extract_dictionary_sklearn(file_path):
    with io.open(file_path, 'r', encoding='utf8') as fin:
        ngram_vectorizer = CountVectorizer(analyzer='word')
        X = ngram_vectorizer.fit_transform(fin)
        vocab = ngram_vectorizer.get_feature_names()
        counts = X.sum(axis=0).A1
    return Counter(dict(zip(vocab, counts)))

def extract_dictionary_native(file_path):
    dictionary = Counter()
    with io.open(file_path, 'r', encoding='utf8') as fin:
        for line in fin:
            dictionary.update(line.split())
    return dictionary

def extract_dictionary_paddle(file_path):
    dictionary = defaultdict(int)
    with io.open(file_path, 'r', encoding='utf8') as fin:
        for line in fin:
            for words in line.split():
                dictionary[word] +=1
    return dictionary

start = time.time()
extract_dictionary_sklearn(infile)
print time.time() - start

start = time.time()
extract_dictionary_native(infile)
print time.time() - start

start = time.time()
extract_dictionary_paddle(infile)
print time.time() - start

[OUT]:

38.306814909
24.8241138458
12.1182529926

上述基准测试中使用的数据大小(154MB):

$ wc -c /path/to/file
161680851

$ wc -l /path/to/file
2176141

有些注意事项:

  • 使用sklearn版本,会产生矢量化器+ numpy操作和转换为Counter对象的开销
  • 然后是原生Counter更新版本,似乎Counter.update()是一项昂贵的操作

答案 3 :(得分:2)

这应该足够了。

def countinfile(filename):
    d = {}
    with open(filename, "r") as fin:
        for line in fin:
            words = line.strip().split()
            for word in words:
                try:
                    d[word] += 1
                except KeyError:
                    d[word] = 1
    return d

答案 4 :(得分:0)

跳过CountVectorizer和scikit-learn。

文件可能太大而无法加载到内存中,但我怀疑python字典是否太大。最简单的选择可能是将大文件拆分为10-20个较小的文件,并扩展代码以循环覆盖较小的文件。

答案 5 :(得分:0)

我没有解码从url读取的整个字节,而是处理二进制数据。因为bytes.translate期望它的第二个参数是一个字节字符串,所以我utf-8编码punctuation。删除标点后,我utf-8解码字节串。

函数freq_dist期望迭代。这就是我通过data.splitlines()的原因。

from urllib2 import urlopen
from collections import Counter
from string import punctuation
from time import time
import sys
from pprint import pprint

url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'

data = urlopen(url).read()

def freq_dist(data):
    """
    :param data: file-like object opened in binary mode or
                 sequence of byte strings separated by '\n'
    :type data: an iterable sequence
    """
    #For readability   
    #return Counter(word for line in data
    #    for word in line.translate(
    #    None,bytes(punctuation.encode('utf-8'))).decode('utf-8').split())

    punc = punctuation.encode('utf-8')
    words = (word for line in data for word in line.translate(None, punc).decode('utf-8').split())
    return Counter(words)


start = time()
word_dist = freq_dist(data.splitlines())
print('elapsed: {}'.format(time() - start))
pprint(word_dist.most_common(10))

输出;

elapsed: 0.806480884552

[(u'de', 11106),
 (u'a', 6742),
 (u'que', 5701),
 (u'la', 4319),
 (u'je', 4260),
 (u'se', 3938),
 (u'\u043d\u0430', 3929),
 (u'na', 3623),
 (u'da', 3534),
 (u'i', 3487)]

似乎dictCounter对象效率更高。

def freq_dist(data):
    """
    :param data: A string with sentences separated by '\n'
    :type data: str
    """
    d = {}
    punc = punctuation.encode('utf-8')
    words = (word for line in data for word in line.translate(None, punc).decode('utf-8').split())
    for word in words:
        d[word] = d.get(word, 0) + 1
    return d

start = time()
word_dist = freq_dist(data.splitlines())
print('elapsed: {}'.format(time() - start))
pprint(sorted(word_dist.items(), key=lambda x: (x[1], x[0]), reverse=True)[:10])

输出;

elapsed: 0.642680168152

[(u'de', 11106),
 (u'a', 6742),
 (u'que', 5701),
 (u'la', 4319),
 (u'je', 4260),
 (u'se', 3938),
 (u'\u043d\u0430', 3929),
 (u'na', 3623),
 (u'da', 3534),
 (u'i', 3487)]

要在打开大文件时提高内存效率,您必须只传递打开的URL。但时间也包括文件下载时间。

data = urlopen(url)
word_dist = freq_dist(data)

答案 6 :(得分:0)

您可以尝试使用sklearn

def save(self):
    super(Group1, self).save()
    if not self.urlhash:
        if self.user.profile.user_type == 'Business User':
            self.urlhash = 'B' + str(self.User.id) + ('00') + str(self.id)
            <b>self.__class__.objects.filter.(pk=self.pk).update(urlhash=urlhash)</b>
        else:
            self.urlhash = 'P' + str(self.User.id) + ('00') + str(self.id)
            <b>self.__class__.objects.filter.(pk=self.pk).update(urlhash=urlhash)

答案 7 :(得分:0)

结合其他人的观点和我自己的观点:) 这是我给你的东西

from collections import Counter
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

text='''Note that if you use RegexpTokenizer option, you lose 
natural language features special to word_tokenize 
like splitting apart contractions. You can naively 
split on the regex \w+ without any need for the NLTK.
'''

# tokenize
raw = ' '.join(word_tokenize(text.lower()))

tokenizer = RegexpTokenizer(r'[A-Za-z]{2,}')
words = tokenizer.tokenize(raw)

# remove stopwords
stop_words = set(stopwords.words('english'))
words = [word for word in words if word not in stop_words]

# count word frequency, sort and return just 20
counter = Counter()
counter.update(words)
most_common = counter.most_common(20)
most_common

输出

(所有人)

[('note', 1),
 ('use', 1),
 ('regexptokenizer', 1),
 ('option', 1),
 ('lose', 1),
 ('natural', 1),
 ('language', 1),
 ('features', 1),
 ('special', 1),
 ('word', 1),
 ('tokenize', 1),
 ('like', 1),
 ('splitting', 1),
 ('apart', 1),
 ('contractions', 1),
 ('naively', 1),
 ('split', 1),
 ('regex', 1),
 ('without', 1),
 ('need', 1)]

就效率而言,可以做得比这更好,但是如果您不太担心它,那么这段代码是最好的。