Question

我尝试使用gensim＆＃vers（ver 1.0.1）4_99.txt来获取文档的余弦相似度。这应该相对简单，但我在检索文档向量时遇到问题，因此我可以进行余弦相似性。当我尝试通过我在培训中给出的标签检索文档时，我收到了一个关键错误。

例如， print(model.docvecs.doctags) 会告诉我没有'4_99.txt_3': Doctag(offset=1644, word_count=12, doc_count=1)这样的钥匙。

但是，如果我打印doc2vec，我会看到以下内容： similarity(d1, d2)

因此，似乎对于每个文档，#Obtain txt abstracts and txt patents filedir = os.path.abspath(os.path.join(os.path.dirname(__file__))) files = os.listdir(filedir) #Doc2Vec takes [['a', 'sentence'], 'and label'] docLabels = [f for f in files if f.endswith('.txt')] sources = {} #{'2_139.txt': '2_139.txt'} for lable in docLabels: sources[lable] = lable sentences = LabeledLineSentence(sources) model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8) model.build_vocab(sentences.to_array()) for epoch in range(10): model.train(sentences.sentences_perm()) model.save('./a2v.d2v')都将每个句子保存为＆＃34;文档名称下划线数字＆＃34;

所以我也是 A）训练不正确或 B）不要理解如何检索doc向量，以便我可以class LabeledLineSentence(object):

任何人都可以帮助我吗？

以下是我训练doc2vec的方法：

def __init__(self, sources):
    self.sources = sources

    flipped = {}

    # make sure that keys are unique
    for key, value in sources.items():
        if value not in flipped:
            flipped[value] = [key]
        else:
            raise Exception('Non-unique prefix encountered')

def __iter__(self):
    for source, prefix in self.sources.items():
        with utils.smart_open(source) as fin:
            for item_no, line in enumerate(fin):
                yield LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no])

def to_array(self):
    self.sentences = []
    for source, prefix in self.sources.items():
        with utils.smart_open(source) as fin:
            for item_no, line in enumerate(fin):
                self.sentences.append(LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no]))
    return self.sentences

def sentences_perm(self):
    shuffle(self.sentences)
    return self.sentences

这使用此类

_n

app.config

我从网络教程（https://medium.com/@klintcho/doc2vec-tutorial-using-gensim-ab3ac03d3a1）获得了这个课程，以帮助我解决Doc2Vec奇怪的数据格式要求，说实话，我并不完全理解它。它看起来像这里写的这个类是为每个句子添加Application Configuration File，但是在教程中它们似乎仍然只是给它文件名来检索文档向量...所以我在这里做错了什么？

Answer 1

gensim Doc2Vec类完全使用文档＆＃39;标记＆＃39;你在训练期间通过它作为doc-vectors的关键。

是的，LabeledLineSentence类正在向文档标记添加_n。具体来说，这些似乎是相关文件中的行号。

因此，如果您真正想要的是每行矢量，那么您必须使用_n期间使用训练期间提供的相同键来请求向量。

如果您希望每个文件都是自己的文档，则需要更改语料库类以将整个文件用作文档。看一下你引用的教程，看来他们有第二个LabeledLineSentence类，不是面向行（但仍以这种方式命名），但是你可以没有使用那个变种。

另外，您不需要多次循环和拨打train()，并手动调整alpha。在任何最新版本的gensim中，train()已经多次遍历语料库，这几乎肯定不会达到您的意图。在最新版本的gensim中，如果你这样称呼它甚至会出错，因为网上很多过时的例子都会鼓励这个错误。

只需拨打train()一次 - 它会在构建模型时指定的次数上遍历您的语料库。（默认值为5，但可以使用iter初始化参数进行控制。对于Doc2Vec语料库，10个或更多是常见的。）

使用gensim访问docvector时出现问题

1 个答案: