Question

我有两大词组。我需要检查另一个列表中存在的单词百分比，并从其他列表中获得最佳结果。

A <- data.frame(name = c(
  "X-ray right leg arteries",
  "x-ray left shoulder",
  "x-ray leg arteries",
  "x-ray leg with 20km distance"
), stringsAsFactors = F)

B <- data.frame(name = c(
  "X-ray left leg arteries",
  "X-ray leg",
  "xray right leg",
  "X-ray right leg arteries"
), stringsAsFactors = F)

fuzzy_prep_words <- function(words) {
  words <- unlist(strsplit(tolower(gsub("[[:punct:]]", "", words)), "\\W+"))
  return(words)
}

fuzzy_prep_words(A$name)
fuzzy_prep_words(B$name)

我可以从列表中提取单词，但无法计算其他列表中匹配的单词的数量和比例。

“X射线右腿动脉”在B中具有完全匹配，因此它应该返回两列 - 匹配：“”X射线右腿动脉“和距离= 100％。对于第二个短语 - ”X射线左肩“它应该返回匹配 - ”X射线左腿动脉“和距离66.67％，因为在”X射线左肩“中3个单词中的2个单词匹配。对于第3个短语，它应该返回任何”X射线左“腿动脉“，”X线右腿动脉“。

我已经探索过字符串距离算法，例如LV，COSINE，LCS，因此我不想使用它，因为我在真实数据集中有大量短语。

Answer 1

这样的事情怎么样？

m <- lapply(strsplit(tolower(gsub("[[:punct:]]", "", A$name)), " "), function(w1)
    do.call(rbind.data.frame, lapply(strsplit(tolower(gsub("[[:punct:]]", "", B$name)), " "), function(w2) {
        cbind.data.frame(
            matches_string_from_B = paste(w2, collapse = " "),
            percentage = sum(w1 %in% w2) / length(w1) * 100)
        }
    ))
)
names(m) <- tolower(gsub("[[:punct:]]", "", A$name));

m;
$`xray right leg arteries`
    matches_string_from_B percentage
1  xray left leg arteries         75
2                xray leg         50
3          xray right leg         75
4 xray right leg arteries        100

$`xray left shoulder`
    matches_string_from_B percentage
1  xray left leg arteries   66.66667
2                xray leg   33.33333
3          xray right leg   33.33333
4 xray right leg arteries   33.33333

$`xray leg arteries`
    matches_string_from_B percentage
1  xray left leg arteries  100.00000
2                xray leg   66.66667
3          xray right leg   66.66667
4 xray right leg arteries  100.00000

$`xray leg with 20km distance`
    matches_string_from_B percentage
1  xray left leg arteries         40
2                xray leg         40
3          xray right leg         40
4 xray right leg arteries         40

说明：将A$name中的条目拆分为单词，计算来自B$name的拆分条目中匹配单词的百分比，并存储在dataframes列表中。使用toupper和gsub("[[:punct:]]", "", ...)使匹配不区分大小写并忽略标点字符。

更新

要获得最佳匹配（百分比），您可以这样做：

do.call(rbind.data.frame, lapply(m, function(x) x[which.max(x$percentage), ]))
#                              matches_string_from_B percentage
#xray right leg arteries     xray right leg arteries  100.00000
#xray left shoulder           xray left leg arteries   66.66667
#xray leg arteries            xray left leg arteries  100.00000
#xray leg with 20km distance  xray left leg arteries   40.00000

计数短语中的单词数量匹配

1 个答案:

更新