您好我是Python SKLearn和ML的新手。我在使用MultinomialNB局部拟合时遇到内存错误,我试图对DMOZ目录数据进行多标签分类。
我的问题:
方法:
将DMOZ DB目录存储到MongoDB / TokuMX
{
"_id": {
"$oid": "54e758c91d41c804d8ace196"
},
"docs": [
{
"url": "http://www.awn.com/",
"description": "Provides information resources to the international animation community. Features include searchable database archives, monthly magazine, web animation guide, the Animation Village, discussion forums and other useful resources.",
"title": "Animation World Network"
}
],
"labels": [
"Top",
"Arts",
"Animation"
]
}
遍历docs
数组并将docs
元素传递到我的分类器函数中。
Vectorizer和Classifier
classifier = MultinomialNB()
vectorizer = HashingVectorizer(
stop_words='english',
strip_accents='unicode',
norm='l2'
)
我的分类器功能
def classify(doc, labels, classifier, vectorizer, *args):
r = requests.get(doc['url'], verify=False)
print "Retrieving URL = {0}\n".format(doc['url'])
if r.status_code == 200:
html = lxml.html.fromstring(r.text)
doc['content'] = []
tags = ['font', 'td', 'h1', 'h2', 'h3', 'p', 'title']
for tag in tags:
for x in html.xpath('//'+tag):
try:
bag_of_words = nltk.word_tokenize(x.text_content())
pos_tagged = nltk.pos_tag(bag_of_words)
for word, pos in pos_tagged:
if pos[:2] == 'NN':
doc['content'].append(word)
except AttributeError as e:
print e
x_train = vectorizer.fit_transform(doc['content'])
#if we are the first one to run partial_fit, pass all classes
if len(args) == 1:
classifier.partial_fit(x_train, labels, classes=args[0])
else:
classifier.partial_fit(x_train, labels)
return doc
X:doc['content']
由一个带NOUNS的数组组成。 (600)
Y:labels
由一个数组组成,上面显示了mongo文档中的标签。 (3)
类args[0]
由数组中的所有(UNIQUE)labels
组成。 (17490)
在Quadcore笔记本电脑上运行VirtualBox,并将4gb ram分配给VM。
答案 0 :(得分:0)
17490独特标签是什么?每个标签和每个功能都有一个系数,这可能是您的内存错误来自的地方。