Question

这里有一个类似的问题Gensim Doc2Vec Exception AttributeError: 'str' object has no attribute 'words'，但没有得到任何有用的答案。

我正在尝试在20newsgroups语料库上训练Doc2Vec。以下是我构建词汇的方法：

from sklearn.datasets import fetch_20newsgroups
    def get_data(subset):
        newsgroups_data = fetch_20newsgroups(subset=subset, remove=('headers', 'footers', 'quotes'))
        docs = []
        for news_no, news in enumerate(newsgroups_data.data):       
            tokens = gensim.utils.to_unicode(news).split() 
            if len(tokens) == 0:
                continue
            sentiment =  newsgroups_data.target[news_no]
            tags = ['SENT_'+ str(news_no), str(sentiment)]
            docs.append(TaggedDocument(tokens, tags))
        return docs

    train_docs = get_data('train')
    test_docs = get_data('test')
    alldocs = train_docs + test_docs

    model = Doc2Vec(dm=dm, size=size, window=window, alpha = alpha, negative=negative, sample=sample, min_count = min_count, workers=cores, iter=passes)
    model.build_vocab(alldocs)

然后我训练模型并保存结果：

model.train(train_docs, total_examples = len(train_docs), epochs = model.iter)
model.train_words = False
model.train_labels = True
model.train(test_docs, total_examples = len(test_docs), epochs = model.iter)

model.save(output)

尝试加载模型时出现问题： screen

我试过了：

使用LabeledSentence代替TaggedDocument
产生TaggedDocument而不是将它们附加到列表
将min_count设置为1，这样就不会忽略任何单词（以防万一）

此问题也出现在python2以及python3上。

请帮我解决这个问题。

Answer 1

您在非现场（imgur）“screen”链接中隐藏了最重要的信息 - 触发错误的确切代码和错误文本本身。（这将是剪切和粘贴到问题中的理想文本，而不是其他似乎运行正常的步骤，而不会触发错误。）

查看该截图，有一行：

model = Doc2Vec("20ng_infer")

...触发错误。

请注意，documented for the Doc2Vec() initialization method的参数都不是普通字符串，就像上面一行中的"20ng_infer"参数一样 - 因此不太可能做任何有用的事情。

如果尝试加载之前使用model.save()保存的模型，则应使用Doc2Vec.load() - 这将使用一个字符串来描述从中加载模型的本地文件路径。所以试试：

model = Doc2Vec.load("20ng_infer")

（另请注意，较大的模型可能会保存到多个文件中，所有文件都以您提供给save()的字符串开头，并且这些文件必须一起保存/移动以重新load()它们未来。）

在20newsgroups数据集上训练Doc2Vec。获取异常AttributeError：'str'对象没有属性'words'

1 个答案: