我的代码的训练部分可以处理数量级为
[(4), (0 1 2 3 4), (0 2 4 1 3), (0 3 1 4 2), (0 4 3 2 1)]
的数据,但考虑到我的整个数据集包含~500,000条评论,我想用更多的数据训练它。在运行培训师时,我似乎已经耗尽了100,000条评论。
我的10^4
功能似乎是罪魁祸首。
get_features
为data = get_data(limit=size)
data = clean_data(data)
all_words = [w.lower() for (comment, category) in data for w in comment]
word_features = []
for i in nltk.FreqDist(all_words).most_common(3000):
word_features.append(i[0])
random.shuffle(data)
def get_features(comment):
features = {}
for word in word_features:
features[word] = (word in set(comment)) # error here
return features
# I can do it myself like this:
feature_set = [(get_features(comment), category) for
(comment, category) in data]
# Or use nltk's Lazy Map implementation which arguable does the same thing:
# feature_set = nltk.classify.apply_features(get_features, data, labeled=True)
评论运行此操作会占用我的所有32GB内存,并最终在100,000
行Memory Error
处崩溃。
我该怎么做才能缓解这个问题?
编辑:我已经大大减少了功能的数量:我现在只使用前3000个最常用的单词作为功能 - 这显着改善了性能(出于显而易见的原因)。我还纠正了@Marat指出的一个小错误。
答案 0 :(得分:1)
免责声明:此代码存在许多潜在的缺陷,因此我希望很少有迭代才能找到根本原因。
# defined with one parameter
def get_features(comment):
...
# called with two
... get_features(comment, word_features), ...
# set(comment) executed on every iteration
for word in word_features:
features[word] = (word in set(comment))
# can be transformed into something like:
word_set = set(comment)
for word in word_features:
features[word] = word in word_set
# if typical comment length is < 30, list lookup is faster
for word in word_features:
features[word] = word in comment
# it is cheaper to set few positives than to check all word_features
# also MUCH more memory efficient
from collections import defaultdict
...
def get_features(comment):
features = defaultdict(bool)
for word in comment:
features[word] = True
return features
# numpy array is much more efficient than a list of dicts
# .. and with pandas on top it's even nicer:
import pandas as pd
...
feature_set = pd.DataFrame(
({word: True for word in comment}
for (comment, _) in data),
columns = word_features
).fillna(False)
feature_set['category'] = [category for (_, category) in data]