TL; DR

Question

给出的是文本文件列表。每个文本文件都描述一个主题。输入是一个我用几句话描述的心理概念。

文本文件包含变音符号。

该算法应输出所描述概念的文件和概率。</ p>

我的伪代码：

split the concept by the space literal and put words into an array, while omitting stopwords
iterate over each text file
    split by the space literal and put words into an array, while omitting stopwords
    i = 0
    iterate over vector
        if vectorword in concept
            i++
    determine percentage by using i/vectorcount * 100
    save the percentage in a dictionary filename - percentage
sort dictionary by percentage descendingly
output

我在这种方法中看到的缺点：

输出将不包含类似的单词，而仅包含所使用的单词。
代码是多余的，每个文本文件只能进行一次迭代，然后再使用一种更快的方法（例如数据库）进行工作

Answer 1

TL; DR

来自https://colab.research.google.com/drive/1wXmqj3LAL6juxvQY_IHTtZAMuN46YZdV

import itertools

import torch

import tensorflow as tf
import tensorflow_hub as hub

import numpy as np

from sklearn.metrics.pairwise import cosine_similarity

def cos(a, b):
    return cosine_similarity(torch.tensor(a).view(1, -1), torch.tensor(b).view(1, -1))[0][0]


# Printing candies, make sure that arrays 
# are ellipsis and humanly readable.
np.set_printoptions(precision=4, threshold=10)

# The URL that hosts the DAN model for Universal Sentence Encoder 
module_url = "https://tfhub.dev/google/universal-sentence-encoder/2"

embed = hub.Module(module_url)

bulbasaur = """A strange seed was planted on its back at birth. The plant sprouts and grows with this POKéMON."""
ivysaur = """When the bulb on its back grows large, it appears to lose the ability to stand on its hind legs."""
venusaur = """The plant blooms when it is absorbing solar energy. It stays on the move to seek sunlight."""

charmander = """Obviously prefers hot places. When it rains, steam is said to spout from the tip of its tail."""
charmeleon = """When it swings its burning tail, it elevates the temperature to unbearably high levels."""
charizard = """Spits fire that is hot enough to melt boulders. Known to cause forest fires unintentionally."""

input_texts = [bulbasaur, ivysaur, venusaur, 
              charmander, charmeleon, charizard]

with tf.Session() as session:
    session.run([tf.global_variables_initializer(), tf.tables_initializer()])
    sentence_embeddings = session.run(embed(input_texts))

names = ['bulbasaur', 'ivysaur  ', 'venusaur', 
         'charmander', 'charmeleon', 'charizard']

for (mon1, vec1), (mon2, vec2) in itertools.product(zip(names, sentence_embeddings), repeat=2):
    print('\t'.join(map(str, [mon1, mon2, cos(vec1, vec2)])))

[输出]：

bulbasaur   bulbasaur   1.0000002
bulbasaur   ivysaur     0.5978951
bulbasaur   venusaur    0.57630616
bulbasaur   charmander  0.27358365
bulbasaur   charmeleon  0.36671823
bulbasaur   charizard   0.3608557
ivysaur     bulbasaur   0.5978951
ivysaur     ivysaur     1.0
ivysaur     venusaur    0.5274135
ivysaur     charmander  0.34133852
ivysaur     charmeleon  0.54503417
ivysaur     charizard   0.26368174
venusaur    bulbasaur   0.57630616
venusaur    ivysaur     0.5274135
venusaur    venusaur    0.99999994
venusaur    charmander  0.37098676
venusaur    charmeleon  0.50332355
venusaur    charizard   0.50058115
charmander  bulbasaur   0.27358365
charmander  ivysaur     0.34133852
charmander  venusaur    0.37098676
charmander  charmander  1.0000001
charmander  charmeleon  0.58522964
charmander  charizard   0.4640133
charmeleon  bulbasaur   0.36671823
charmeleon  ivysaur     0.54503417
charmeleon  venusaur    0.50332355
charmeleon  charmander  0.58522964
charmeleon  charmeleon  1.0000001
charmeleon  charizard   0.59804976
charizard   bulbasaur   0.3608557
charizard   ivysaur     0.26368174
charizard   venusaur    0.50058115
charizard   charmander  0.4640133
charizard   charmeleon  0.59804976
charizard   charizard   1.0000001

有关详情，请参见https://tfhub.dev/google/universal-sentence-encoder/2

Answer 2

一般来说，我会使用词嵌入-> doc2vec的一些变体，并将其应用于您的文本文件，存储这些向量。对于心理概念输入，我会做同样的事情，然后搜索最相似的向量。我有点喜欢伪造工具 https://spacy.io/api/doc和https://spacy.io/usage/vectors-similarity应该为您指明正确的方向。 PS：https://stackoverflow.com/help/how-to-ask

相似性搜索的现代方法是什么？

2 个答案:

TL; DR