我正在尝试修改Doc2vec tutorial以取Pandas
个数据帧而不是.txt
个文档。我想找到一个与我从数据中输入的新句子最相似的句子。然而,在训练之后,即使我给出了与数据集中存在的几乎相同的句子,我得到的是低精度结果作为最高结果,并且它们都不是我修改的句子。例如,我有句“这是一只好猫”。在我训练Doc2vec的数据集中,然后我使用新句子“你有的这只猫非常好”。作为输入,并没有将第一句话视为相似。
数据来自Excel工作表,大致看起来像是:
Description | Group | Number
0 Sent: This is a sentence Regular NUM1234
1 Sent: Another sentence Regular NUM1243
2 Sent: Basically all the input Other group NUM1278
3 Sent: Creating a test case to validate the routing between applications. No action needed at this moment
| Other group | NUM1287
...etc...
我有以下代码(修改后不需要一些代码):
df = pd.read_excel("my_data.xls")
df["Description"] = df["Description"].apply(lambda x: removeGeneric(x)) #removeGeneric() just strips "Sent:" from the beginning of each sentence
for index, row in df.iterrows():
row["Description"] = row["Description"].lower()
row["Description"] = normalize_text(row["Description"]) #normalize_text() removes stopwords defined in the nltk package and words shorter than 2 characters
SentimentDocument = namedtuple('SentimentDocument', 'words tags')
alldocs = []
for index, row in df.iterrows():
words = gensim.utils.to_unicode(row["Description"]).split()
tags = [row["Number"]]
alldocs.append(SentimentDocument(words, tags))
doc_list = alldocs[:]
cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"
simple_models = [
# PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size
Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),
# PV-DBOW
Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),
# PV-DM w/ average
Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),
]
# Speed up setup by sharing results of the 1st model's vocabulary scan
simple_models[0].build_vocab(alldocs) # PV-DM w/ concat requires one special NULL word so it serves as template
print(simple_models[0])
for model in simple_models[1:]:
model.reset_from(simple_models[0])
print(model)
models_by_name = OrderedDict((str(model), model) for model in simple_models)
from random import shuffle
alpha, min_alpha, passes = (0.025, 0.001, 20)
alpha_delta = (alpha - min_alpha) / passes
print("START %s" % datetime.datetime.now())
for epoch in range(passes):
shuffle(doc_list)
for name, train_model in models_by_name.items():
# Train
duration = 'na'
train_model.alpha, train_model.min_alpha = alpha, alpha
with elapsed_timer() as elapsed:
train_model.train(doc_list, total_examples=len(doc_list), epochs=1)
for model in simple_models:
new_sentence = "Test case creation to validation of routing between applications. No action needed" #Notice how I'm testing with a sentence very similar to one in the original dataset
new_sentence = removeGeneric(new_sentence)
new_sentence = normalize_text(new_sentence)
print(model.docvecs.most_similar(positive=[model.infer_vector(new_sentence)],topn=2))
为此,我得到以下输出:
[('NUM1254', 0.3154909014701843), ('NUM5247', 0.2487245500087738)]
[('NUM3875', 0.20226456224918365), ('NUM3793', 0.1970052272081375)]
[('NUM3585', 0.13086965680122375), ('NUM3857', 0.1298370361328125)]
creating test case validate routing applications action needed moment
所有建议完全不相关,句子如“网站ID工厂地址好所有者电力请求批准的区域省”,以及它实际接近的句子(“创建测试用例来验证应用程序之间的路由。”此刻“来自数据集”所需的行动不在列表中。
你能看到我做错的事吗?我该怎么做才能提高准确度?