我想使用轻松的单词移动距离计算文本相似度。我有两个不同的数据集(语料库)。见下文。
A <- data.frame(name = c(
"X-ray right leg arteries",
"consultation of gynecologist",
"x-ray leg arteries",
"x-ray leg with 20km distance",
"x-ray left hand"
), stringsAsFactors = F)
B <- data.frame(name = c(
"X-ray left leg arteries",
"consultation (inspection) of gynecalogist",
"MRI right leg arteries",
"X-ray right leg arteries with special care"
), stringsAsFactors = F)
我在R中使用text2vec
软件包。看来我做错了。
library(text2vec)
library(stringr)
prep_fun = function(x) {
x %>%
# make text lower case
str_to_lower %>%
# remove non-alphanumeric symbols
str_replace_all("[^[:alnum:]]", " ") %>%
# collapse multiple spaces
str_replace_all("\\s+", " ")
}
C = rbind(A, B)
C$name = prep_fun(C$name)
it = itoken(C$name, progressbar = FALSE)
v = create_vocabulary(it) %>% prune_vocabulary()
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
tcm = create_tcm(it, vectorizer, skip_grams_window = 3)
glove_model = GloVe$new(word_vectors_size = 10, vocabulary = v, x_max = 3)
wv = glove_model$fit_transform(tcm, n_iter = 10)
# get average of main and context vectors as proposed in GloVe paper
wv = wv + t(glove_model$components)
rwmd_model = RWMD$new(wv)
rwmd_dist = dist2(dtm[1:nrow(A), ], dtm[nrow(A)+1:nrow(C), ], method = rwmd_model, norm = 'none')
head(rwmd_dist)
[,1] [,2] [,3] [,4]
[1,] 0.1220713 0.7905035 0.3085216 0.4182328
[2,] 0.7043127 0.1883473 0.8031200 0.7038919
[3,] 0.1220713 0.7905035 0.3856520 0.4836772
[4,] 0.5340587 0.6259011 0.7146630 0.2513135
[5,] 0.3403019 0.5575993 0.7568583 0.5124514
所需的输出:应将数据框A的“妇科医生咨询”映射到数据框B的“妇科医生咨询(检查)”。类似地,数据框A的文本应与数据框B的文本匹配。
答案 0 :(得分:0)
我正在做类似或相同的事情,很快我将上传我的试用版。现在,我正在尝试优化矢量,窗口和图形,以使5700个语音的语料库平均在1000至2000个单词之间(去除停用词后,词干)是否足够。
如果仍然需要,将返回并发布链接,但是据我所知,您没有标记语料库-itokens与我所理解的有所不同。同样在互联网上,作者使用word_tokenizer函数。
最后,尝试使用pdist2函数,并在数据框中将所需的文本放在不同的行中。它具有平行距离。