Doc2vec不准确 - 如何改进模型?

时间:2017-11-06 08:36:54

标签: python cosine-similarity doc2vec

我正在尝试修改Doc2vec tutorial以取Pandas个数据帧而不是.txt个文档。我想找到一个与我从数据中输入的新句子最相似的句子。然而,在训练之后,即使我给出了与数据集中存在的几乎相同的句子,我得到的是低精度结果作为最高结果,并且它们都不是我修改的句子。例如,我有句“这是一只好猫”。在我训练Doc2vec的数据集中,然后我使用新句子“你有的这只猫非常好”。作为输入,并没有将第一句话视为相似。

数据来自Excel工作表,大致看起来像是:

  Description                  | Group        | Number
0 Sent: This is a sentence       Regular        NUM1234
1 Sent: Another sentence         Regular        NUM1243
2 Sent: Basically all the input  Other group    NUM1278
3 Sent: Creating a test case to validate the routing between applications.  No action needed at this moment 
                               | Other group  | NUM1287
...etc...

我有以下代码(修改后不需要一些代码):

df = pd.read_excel("my_data.xls")

df["Description"] = df["Description"].apply(lambda x: removeGeneric(x)) #removeGeneric() just strips "Sent:" from the beginning of each sentence 
for index, row in df.iterrows():
    row["Description"] = row["Description"].lower()
    row["Description"] = normalize_text(row["Description"]) #normalize_text() removes stopwords defined in the nltk package and words shorter than 2 characters

SentimentDocument = namedtuple('SentimentDocument', 'words tags')

alldocs = []  
for index, row in df.iterrows():
    words = gensim.utils.to_unicode(row["Description"]).split()
    tags = [row["Number"]]
    alldocs.append(SentimentDocument(words, tags))

doc_list = alldocs[:]
cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

simple_models = [
    # PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size
    Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),
    # PV-DBOW 
    Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),
    # PV-DM w/ average
    Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),
]

# Speed up setup by sharing results of the 1st model's vocabulary scan
simple_models[0].build_vocab(alldocs)  # PV-DM w/ concat requires one special NULL word so it serves as template
print(simple_models[0])
for model in simple_models[1:]:
    model.reset_from(simple_models[0])
    print(model)

models_by_name = OrderedDict((str(model), model) for model in simple_models)

from random import shuffle

alpha, min_alpha, passes = (0.025, 0.001, 20)
alpha_delta = (alpha - min_alpha) / passes

print("START %s" % datetime.datetime.now())

for epoch in range(passes):
    shuffle(doc_list)

    for name, train_model in models_by_name.items():
        # Train
        duration = 'na'
        train_model.alpha, train_model.min_alpha = alpha, alpha
        with elapsed_timer() as elapsed:
            train_model.train(doc_list, total_examples=len(doc_list), epochs=1)

for model in simple_models:
    new_sentence = "Test case creation to validation of routing between applications.  No action needed" #Notice how I'm testing with a sentence very similar to one in the original dataset
    new_sentence = removeGeneric(new_sentence)
    new_sentence = normalize_text(new_sentence)
    print(model.docvecs.most_similar(positive=[model.infer_vector(new_sentence)],topn=2))

为此,我得到以下输出:

[('NUM1254', 0.3154909014701843), ('NUM5247', 0.2487245500087738)]
[('NUM3875', 0.20226456224918365), ('NUM3793', 0.1970052272081375)]
[('NUM3585', 0.13086965680122375), ('NUM3857', 0.1298370361328125)]
creating test case validate routing applications action needed moment

所有建议完全不相关,句子如“网站ID工厂地址好所有者电力请求批准的区域省”,以及它实际接近的句子(“创建测试用例来验证应用程序之间的路由。”此刻“来自数据集”所需的行动不在列表中。

你能看到我做错的事吗?我该怎么做才能提高准确度?

0 个答案:

没有答案