扩展Word2vec进行监督

时间:2018-09-30 15:36:42

标签: r word2vec text2vec

我有一些医疗服务,在下一栏中也有类似的服务。 在下面的代码中,name1列具有服务,而name2列具有类似的服务。您可以考虑将name2作为目标变量。我正在尝试提取单词同义词。

A <- data.frame(name1 = c(
  "X-ray right leg arteries",
  "consultation of gynecologist",
  "Magnetic reasoning imaging leg arteries",
  "radiography leg with 20km distance"
), name2 = c(
  "Radiography left leg arteries",
  "inspection of gynecalogist",
  "MRI right leg arteries",
  "x-ray right leg arteries"
), stringsAsFactors = F)

我将这两列结合在一起,以便可以应用word2vec算法并查找同义词。

A["name"] = paste(A$name1, A$name2)
A["name"] = gsub("[[:punct:]]", "",  A$name)

这种方法的问题-由于连接后,相似的单词(如X射线照相)距离较远。 Word2Vec算法正在对它们进行惩罚。

library(magrittr)
library(text2vec)
library(stringr)
library(stringi)

tokens = A$name %>% tolower %>% word_tokenizer()
it = itoken(tokens)

# Create and Prune vocabulary
v = create_vocabulary(it) %>% prune_vocabulary(term_count_min=1)
vectorizer = vocab_vectorizer(v)

# Term-co-occurence matrix
tcm = create_tcm(it, vectorizer, skip_grams_window = 10)

# Glove Model
model = GlobalVectors$new(word_vectors_size=50, vocabulary=v, x_max=1, learning_rate=0.15)
wv_main = model$fit_transform(tcm,n_iter=25)
wv = model$components  #Dimension words x wvec_size
word_vectors = wv_main + t(wv)

#Make distance matrix
d = dist2(word_vectors, method="cosine")  #Smaller values means closer

# Find closely related words
findCloseWords = function(w,d,n) {
  words = rownames(d)
  i = which(words==w)
  if (length(i) > 0) {
    res = sort(d[i,])
    print(as.matrix(res[2:(n+1)]))
  }
  else {
    print("Word not in corpus.")
  }
}

findCloseWords("xray",d,3)

               [,1]
    right 0.7636013
    mri   0.8105633
    leg   0.8390371

它应该像X射线一样返回射线照相,而射线照相意味着同一件事。

0 个答案:

没有答案