手套词移动相似

时间:2018-09-08 17:10:36

标签: r nlp text2vec

我想使用轻松的单词移动距离计算文本相似度。我有两个不同的数据集(语料库)。见下文。

A <- data.frame(name = c(
  "X-ray right leg arteries",
  "consultation of gynecologist",
  "x-ray leg arteries",
  "x-ray leg with 20km distance",
  "x-ray left hand"
), stringsAsFactors = F)

B <- data.frame(name = c(
  "X-ray left leg arteries",
  "consultation (inspection) of gynecalogist",
  "MRI right leg arteries",
  "X-ray right leg arteries with special care"
), stringsAsFactors = F)

我正在R中使用text2vec软件包。

library(text2vec)
library(stringr)
prep_fun = function(x) {
  x %>% 
    # make text lower case
    str_to_lower %>% 
    # remove non-alphanumeric symbols
    str_replace_all("[^[:alnum:]]", " ") %>% 
    # collapse multiple spaces
    str_replace_all("\\s+", " ")
}
Combine both datasets
C = rbind(A, B)

C$name = prep_fun(C$name)

it = itoken(C$name, progressbar = FALSE)
v = create_vocabulary(it) %>% prune_vocabulary()
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
tcm = create_tcm(it, vectorizer, skip_grams_window = 3)
glove_model = GloVe$new(word_vectors_size = 10, vocabulary = v, x_max = 3)
wv = glove_model$fit_transform(tcm, n_iter = 10)

# get average of main and context vectors as proposed in GloVe paper
wv = wv + t(glove_model$components)
rwmd_model = RWMD$new(wv)
rwmd_dist = dist2(dtm[1:nrow(A), ], dtm[nrow(A)+1:nrow(C), ], method = rwmd_model, norm = 'none')

head(rwmd_dist)

          [,1]      [,2]      [,3]      [,4]
[1,] 0.1220713 0.7905035 0.3085216 0.4182328
[2,] 0.7043127 0.1883473 0.8031200 0.7038919
[3,] 0.1220713 0.7905035 0.3856520 0.4836772
[4,] 0.5340587 0.6259011 0.7146630 0.2513135
[5,] 0.3403019 0.5575993 0.7568583 0.5124514

skip_grams_window = 3代码中的tcm = create_tcm(it, vectorizer, skip_grams_window = 3)是否意味着在创建共现矩阵时向右检查3个单词?例如,文本“ X射线右腿动脉”将成为矢量-目标:“ X射线”

right   leg arteries
1   1   1

word_vectors_size的用途是什么?我已经阅读了手套的算法,但不了解此功能的用法。

glove_model = GloVe $ new(word_vectors_size = 10,词汇= v,x_max = 3)

1 个答案:

答案 0 :(得分:0)

建议与skip_grams_window_context参数一起指定"symmetric"(有效值:"right""left"skip_grams_window)。 [Documentation]

word_vectors_size自变量用于定义基础词向量的维数。这意味着每个单词都将在N维向量空间中转换为向量。有几篇文章对单词向量(article 1article 2)进行了很好的解释。

在您的示例中,glove_model = GloVe$new(word_vectors_size = 10, vocabulary = v, x_max = 3)表示10维词向量。

为单词向量选择合适的维数很重要。根据{{​​3}}在2014年10月的答复,

  

典型间隔在100-300之间。我要说您至少需要50D才能达到最低的精度。如果选择的维数较少,则将开始失去高维空间的属性。如果培训时间对您的应用程序来说不是很重要,那么我会坚持使用200D尺寸,因为它具有不错的功能。使用300D可获得极高的精度。 300D单词功能将不会显着改善,并且训练将非常缓慢。