Question

您好我是Python SKLearn和ML的新手。我在使用MultinomialNB局部拟合时遇到内存错误，我试图对DMOZ目录数据进行多标签分类。

我的问题：

我做错了什么？是我缺乏记忆还是数据错误？
我使用正确的方法吗？
我能做些什么来改善我的appraoch？

方法：

将DMOZ DB目录存储到MongoDB / TokuMX

{
  "_id": {
    "$oid": "54e758c91d41c804d8ace196"
  },
  "docs": [
    {
      "url": "http://www.awn.com/",
      "description": "Provides information resources to the international animation community. Features include searchable database archives, monthly magazine, web animation guide, the Animation Village, discussion forums and other useful resources.",
      "title": "Animation World Network"
    }
  ],
  "labels": [
    "Top",
    "Arts",
    "Animation"
  ]
}

遍历docs数组并将docs元素传递到我的分类器函数中。

Vectorizer和Classifier

    classifier = MultinomialNB()
    vectorizer = HashingVectorizer(
            stop_words='english', 
            strip_accents='unicode', 
            norm='l2'
         )

我的分类器功能

def classify(doc, labels, classifier, vectorizer, *args):

    r = requests.get(doc['url'], verify=False)

    print "Retrieving URL = {0}\n".format(doc['url'])

    if r.status_code == 200:
        html = lxml.html.fromstring(r.text)
        doc['content'] = []


        tags = ['font', 'td', 'h1', 'h2', 'h3', 'p', 'title']
        for tag in tags:
            for x in html.xpath('//'+tag):
                try:
                    bag_of_words = nltk.word_tokenize(x.text_content())
                    pos_tagged = nltk.pos_tag(bag_of_words)

                    for word, pos in pos_tagged:
                        if pos[:2] == 'NN':
                            doc['content'].append(word)

                except AttributeError as e:
                    print e

        x_train = vectorizer.fit_transform(doc['content'])

        #if we are the first one to run partial_fit, pass all classes
        if len(args) == 1:
            classifier.partial_fit(x_train, labels, classes=args[0])
        else:
            classifier.partial_fit(x_train, labels)

        return doc

X：doc['content']由一个带NOUNS的数组组成。（600）

Y：labels由一个数组组成，上面显示了mongo文档中的标签。（3）

类args[0]由数组中的所有（UNIQUE）labels组成。（17490）

在Quadcore笔记本电脑上运行VirtualBox，并将4gb ram分配给VM。

Answer 1

17490独特标签是什么？每个标签和每个功能都有一个系数，这可能是您的内存错误来自的地方。

PYTHON：内存错误 - MultinomialNB.partial_fit（） - 17k类

1 个答案: