前缀的重量更高

时间:2018-03-31 09:28:33

标签: r levenshtein-distance cosine-similarity quanteda

在计算相似度时,是否有一种方法或距离方法可以为前缀指定更高的权重?我知道Jaro-Winkler方法,但它的应用仅限于字符。我正在寻找相似的文字。

A <- data.frame(name = c(
  "X-ray right leg arteries",
  "Rgraphy left shoulder",
  "x-ray leg arteries",
  "x-ray leg with 20km distance"
), stringsAsFactors = F)

B <- data.frame(name = c(
  "X-ray left leg arteries",
  "Rgraphy right shoulder",
  "X-ray left shoulder",
  "Rgraphy right leg arteries"
), stringsAsFactors = F)

library(quanteda)
corp1 <- corpus(A, text_field = "name")
corp2 <- corpus(B, text_field = "name")

docnames(corp1) <- paste("A", seq_len(ndoc(corp1)), sep = ".")
docnames(corp2) <- paste("B", seq_len(ndoc(corp2)), sep = ".")

dtm3 <- rbind(dfm(corp1, ngrams=1:2), dfm(corp2, ngrams=2))
d2 <- textstat_simil(dtm3, method = "cosine", diag = TRUE)
as.matrix(d2)[docnames(corp1), docnames(corp2)]

我希望来自dataframeA的“X射线右腿动脉”应该从dataframeB而不是“Rgraphy右腿动脉”映射到“X射线左腿动脉”。这意味着,与“X射线右腿动脉”和“Rgraphy右腿动脉”之间的相似性相比,“X射线右腿动脉”和“X射线左腿动脉”之间的相似度得分应该更高。 / p>

同样,我想要“RRAPy左肩”应该映射到“Rgraphy右肩”而不是“X射线左肩”。上面的例子只是一个例子。实际上,我有一个很重要的列表,它不仅限于“X射线”和“Rgraphy”。因此,我不想在“X射线”和“Rgraphy”上应用滤镜,然后计算相似度。它应该更加基于算法。

1 个答案:

答案 0 :(得分:2)

听起来您希望将某些诊断程序保留为功能而不考虑所使用的确切措辞,以便这些可以构成计算文档之间相似性的基础。

您可以通过在字典中定义短语并在构造dfm之前应用它来完成此操作。在这里,我稍微扩展了您的文本以包含其他功能。

A <- data.frame(text = c("Patient had X-ray right leg arteries.",
                         "Subject was administered Rgraphy left shoulder",
                         "Exam consisted of x-ray leg arteries",
                         "Patient administered x-ray leg with 20km distance."),
                row.names = paste0("A", 1:4), stringsAsFactors = FALSE)
B <- data.frame(text = c(B = "Patient had X-ray left leg arteries",
                         "Rgraphy right shoulder given to patient",
                         "X-ray left shoulder revealed nothing sinister",
                         "Rgraphy right leg arteries tested"), 
                row.names = paste0("A", 1:4), stringsAsFactors = FALSE)

现在,我们可以定义一个词典,其中包含的词组与您想要考虑的短语相匹配,以便计算相似度。在这个例子中,X射线是用于右腿还是左腿无关紧要,或者没有指定。相似之处,我们并不关心&#34; Rgraph&#34;特定于左肩或右肩的程序。 (显然,您需要根据文本中的内容以及您愿意考虑的等同内容来调整和优化这些内容。)

medicaldict <- dictionary(list(
    xray_leg = c("X-ray right leg arteries", "x-ray left leg arteries", 
                 "x-ray leg arteries"),
    rgraphy_leg = c("Rgraphy right leg arteries", "Rgraphy left leg arteries"),
    xray_shoulder = c("X-ray left shoulder", "X-ray right shoulder"),
    rgraphy_shoulder = c("Rgraphy left shoulder", "Rgraphy right shoulder")
))

当我们在&#34;非排他性&#34;中使用tokens_lookup()将其应用于令牌时方式,序列被字典键代替。请注意,因为tokens_lookup()将相关的标记序列折叠为短语,所以不再需要像在您的问题中那样形成标记ngram。

toks <- tokens(corpus(A) + corpus(B)) %>%
    tokens_lookup(dictionary = medicaldict, exclusive = FALSE)
toks
# tokens from 8 documents.
# A1 :
# [1] "Patient"  "had"      "XRAY_LEG" "."       
# 
# A2 :
# [1] "Subject"          "was"              "administered"     "RGRAPHY_SHOULDER"
# 
# A3 :
# [1] "Exam"      "consisted" "of"        "XRAY_LEG" 
# 
# A4 :
# [1] "Patient"      "administered" "x-ray"        "leg"          "with"         "20km"         "distance"     "."           
# 
# A11 :
# [1] "Patient"  "had"      "XRAY_LEG"
# 
# A21 :
# [1] "RGRAPHY_SHOULDER" "given"            "to"               "patient"         
# 
# A31 :
# [1] "XRAY_SHOULDER" "revealed"      "nothing"       "sinister"     
# 
# A41 :
# [1] "RGRAPHY_LEG" "tested"     

最后,我们可以根据折叠的特征计算文档相似度,而不是原始的词袋。

dfm(toks) %>%
    textstat_simil(method = "cosine", diag = TRUE)
#            A1        A2        A3        A4       A11       A21       A31
# A2  0.0000000                                                            
# A3  0.2500000 0.0000000                                                  
# A4  0.3535534 0.1767767 0.0000000                                        
# A11 0.8660254 0.0000000 0.2886751 0.2041241                              
# A21 0.2500000 0.2500000 0.0000000 0.1767767 0.2886751                    
# A31 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000          
# A41 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000