我的代码的训练部分可以处理数量级为 [(4), (0 1 2 3 4), (0 2 4 1 3), (0 3 1 4 2), (0 4 3 2 1)]的数据,但考虑到我的整个数据集包含~500,000条评论,我想用更多的数据训练它。在运行培训师时,我似乎已经耗尽了100,000条评论。



data = get_data(limit=size) data = clean_data(data) all_words = [w.lower() for (comment, category) in data for w in comment] word_features = [] for i in nltk.FreqDist(all_words).most_common(3000): word_features.append(i[0]) random.shuffle(data) def get_features(comment): features = {} for word in word_features: features[word] = (word in set(comment)) # error here return features # I can do it myself like this: feature_set = [(get_features(comment), category) for (comment, category) in data] # Or use nltk's Lazy Map implementation which arguable does the same thing: # feature_set = nltk.classify.apply_features(get_features, data, labeled=True) 评论运行此操作会占用我的所有32GB内存,并最终在100,000Memory Error处崩溃。


编辑:我已经大大减少了功能的数量:我现在只使用前3000个最常用的单词作为功能 - 这显着改善了性能(出于显而易见的原因)。我还纠正了@Marat指出的一个小错误。

1 个答案:

答案 0 :(得分:1)



# defined with one parameter
def get_features(comment):

# called with two
... get_features(comment, word_features), ...


# set(comment) executed on every iteration
for word in word_features:
    features[word] = (word in set(comment))

# can be transformed into something like:
word_set = set(comment)
for word in word_features:
    features[word] = word in word_set

# if typical comment length is < 30, list lookup is faster
for word in word_features:
    features[word] = word in comment


# it is cheaper to set few positives than to check all word_features
# also MUCH more memory efficient
from collections import defaultdict
def get_features(comment):
    features = defaultdict(bool)
    for word in comment:
        features[word] = True
    return features


# numpy array is much more efficient than a list of dicts
# .. and with pandas on top it's even nicer:
import pandas as pd
feature_set = pd.DataFrame(
    ({word: True for word in comment}
      for (comment, _) in data),
    columns = word_features
feature_set['category'] = [category for (_, category) in data]