我有一些医疗服务,在下一栏中也有类似的服务。
在下面的代码中,name1
列具有服务,而name2
列具有类似的服务。您可以考虑将name2
作为目标变量。我正在尝试提取单词同义词。
A <- data.frame(name1 = c(
"X-ray right leg arteries",
"consultation of gynecologist",
"Magnetic reasoning imaging leg arteries",
"radiography leg with 20km distance"
), name2 = c(
"Radiography left leg arteries",
"inspection of gynecalogist",
"MRI right leg arteries",
"x-ray right leg arteries"
), stringsAsFactors = F)
我将这两列结合在一起,以便可以应用word2vec算法并查找同义词。
A["name"] = paste(A$name1, A$name2)
A["name"] = gsub("[[:punct:]]", "", A$name)
这种方法的问题-由于连接后,相似的单词(如X射线照相)距离较远。 Word2Vec算法正在对它们进行惩罚。
library(magrittr)
library(text2vec)
library(stringr)
library(stringi)
tokens = A$name %>% tolower %>% word_tokenizer()
it = itoken(tokens)
# Create and Prune vocabulary
v = create_vocabulary(it) %>% prune_vocabulary(term_count_min=1)
vectorizer = vocab_vectorizer(v)
# Term-co-occurence matrix
tcm = create_tcm(it, vectorizer, skip_grams_window = 10)
# Glove Model
model = GlobalVectors$new(word_vectors_size=50, vocabulary=v, x_max=1, learning_rate=0.15)
wv_main = model$fit_transform(tcm,n_iter=25)
wv = model$components #Dimension words x wvec_size
word_vectors = wv_main + t(wv)
#Make distance matrix
d = dist2(word_vectors, method="cosine") #Smaller values means closer
# Find closely related words
findCloseWords = function(w,d,n) {
words = rownames(d)
i = which(words==w)
if (length(i) > 0) {
res = sort(d[i,])
print(as.matrix(res[2:(n+1)]))
}
else {
print("Word not in corpus.")
}
}
findCloseWords("xray",d,3)
[,1]
right 0.7636013
mri 0.8105633
leg 0.8390371
它应该像X射线一样返回射线照相,而射线照相意味着同一件事。