给出的是文本文件列表。每个文本文件都描述一个主题。输入是一个我用几句话描述的心理概念。
文本文件包含变音符号。
该算法应输出所描述概念的文件和概率。</ p>
我的伪代码:
split the concept by the space literal and put words into an array, while omitting stopwords
iterate over each text file
split by the space literal and put words into an array, while omitting stopwords
i = 0
iterate over vector
if vectorword in concept
i++
determine percentage by using i/vectorcount * 100
save the percentage in a dictionary filename - percentage
sort dictionary by percentage descendingly
output
我在这种方法中看到的缺点:
答案 0 :(得分:0)
来自https://colab.research.google.com/drive/1wXmqj3LAL6juxvQY_IHTtZAMuN46YZdV
import itertools
import torch
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def cos(a, b):
return cosine_similarity(torch.tensor(a).view(1, -1), torch.tensor(b).view(1, -1))[0][0]
# Printing candies, make sure that arrays
# are ellipsis and humanly readable.
np.set_printoptions(precision=4, threshold=10)
# The URL that hosts the DAN model for Universal Sentence Encoder
module_url = "https://tfhub.dev/google/universal-sentence-encoder/2"
embed = hub.Module(module_url)
bulbasaur = """A strange seed was planted on its back at birth. The plant sprouts and grows with this POKéMON."""
ivysaur = """When the bulb on its back grows large, it appears to lose the ability to stand on its hind legs."""
venusaur = """The plant blooms when it is absorbing solar energy. It stays on the move to seek sunlight."""
charmander = """Obviously prefers hot places. When it rains, steam is said to spout from the tip of its tail."""
charmeleon = """When it swings its burning tail, it elevates the temperature to unbearably high levels."""
charizard = """Spits fire that is hot enough to melt boulders. Known to cause forest fires unintentionally."""
input_texts = [bulbasaur, ivysaur, venusaur,
charmander, charmeleon, charizard]
with tf.Session() as session:
session.run([tf.global_variables_initializer(), tf.tables_initializer()])
sentence_embeddings = session.run(embed(input_texts))
names = ['bulbasaur', 'ivysaur ', 'venusaur',
'charmander', 'charmeleon', 'charizard']
for (mon1, vec1), (mon2, vec2) in itertools.product(zip(names, sentence_embeddings), repeat=2):
print('\t'.join(map(str, [mon1, mon2, cos(vec1, vec2)])))
[输出]:
bulbasaur bulbasaur 1.0000002
bulbasaur ivysaur 0.5978951
bulbasaur venusaur 0.57630616
bulbasaur charmander 0.27358365
bulbasaur charmeleon 0.36671823
bulbasaur charizard 0.3608557
ivysaur bulbasaur 0.5978951
ivysaur ivysaur 1.0
ivysaur venusaur 0.5274135
ivysaur charmander 0.34133852
ivysaur charmeleon 0.54503417
ivysaur charizard 0.26368174
venusaur bulbasaur 0.57630616
venusaur ivysaur 0.5274135
venusaur venusaur 0.99999994
venusaur charmander 0.37098676
venusaur charmeleon 0.50332355
venusaur charizard 0.50058115
charmander bulbasaur 0.27358365
charmander ivysaur 0.34133852
charmander venusaur 0.37098676
charmander charmander 1.0000001
charmander charmeleon 0.58522964
charmander charizard 0.4640133
charmeleon bulbasaur 0.36671823
charmeleon ivysaur 0.54503417
charmeleon venusaur 0.50332355
charmeleon charmander 0.58522964
charmeleon charmeleon 1.0000001
charmeleon charizard 0.59804976
charizard bulbasaur 0.3608557
charizard ivysaur 0.26368174
charizard venusaur 0.50058115
charizard charmander 0.4640133
charizard charmeleon 0.59804976
charizard charizard 1.0000001
有关详情,请参见https://tfhub.dev/google/universal-sentence-encoder/2
答案 1 :(得分:0)
一般来说,我会使用词嵌入-> doc2vec的一些变体,并将其应用于您的文本文件,存储这些向量。对于心理概念输入,我会做同样的事情,然后搜索最相似的向量。我有点喜欢伪造工具 https://spacy.io/api/doc和https://spacy.io/usage/vectors-similarity应该为您指明正确的方向。 PS:https://stackoverflow.com/help/how-to-ask