Question

我有2个数据集。

a <- c("adidas shoes","hot tea","pizza","hill station")
b <- c("shoes","plastic cup","pizza","I love to go to hill","travelling in motor van",
       "buy adidas shoes","run using adidas shoes")

我想将第一个向量的每个句子中的每个单词与第二个向量的所有元素进行匹配，然后选择最大单词数匹配的那个。

为此，我使用了以下代码：

a_split <- unlist(strsplit(a[1,], " "))
b_split <- unlist(strsplit(b[1,], " "))
a$match_perc[1] <- length(intersect(a_split, b_split))/length(a_split)*100

所以本质上我要在这里做的是将“ adidas”和“ shoes”（向量“ a”的第一个元素）与向量“ b”的所有元素匹配，最后得出最佳的匹配百分比，并对“ a”的所有元素重复此操作。如果百分比相同，我们将始终采用最高百分比。本质上，对于每个句子，我只有一个匹配的句子作为一个匹配百分比。如果我们有相同的最高百分比，我们将进行第一场比赛。

预期输出如下：

a <- c("adidas shoes","hot tea","pizza","hill station")
Matching_String <- c("buy adidas shoes","NA","pizza","I love to go to hill")
match_perc <- c(100,0,100,50)
final_op <- data.frame(a,Matching_String,match_perc)

Answer 1

您还可以使用purrr::map函数：

library(purrr)
match_perc <- map2_dbl(a, Matching_String, function(a, b) {
  a_split <- unlist(strsplit(a, " "))
  b_split <- unlist(strsplit(b, " "))
  length(intersect(a_split, b_split))/length(a_split)*100
})
final_op <- data.frame(a,Matching_String,match_perc)
final_op
             a      Matching_String match_perc
1 adidas shoes     buy adidas shoes        100
2      hot tea                   NA          0
3        pizza                pizza        100
4 hill station I love to go to hill         50

还可以查看stringr::str_extract_all函数来提取字符串

Answer 2

将strsplit作为列表的输出很有用。

as <- strsplit(a, " ")
bs <- strsplit(b, " ")

您可以通过向量化函数并使用outer来创建这些列表的匹配矩阵。

matchFun <- function(x, y) length(intersect(x, y)) / length(x) * 100
mx <- outer(as, bs, Vectorize(matchFun))

然后将最大值放入向量中。

m <- apply(mx, 1, which.max)  # the maximum column of each row

z <- unlist(apply(p, 1, function(x) x[which.max(x)]))  # maximum percentage
z[z == 0] <- NA  # this gives you the NA if you want it

最后将结果放入数据框中。

data.frame(a, Matching_String=b[m], match_perc=z)

#              a      Matching_String match_perc
# 1 adidas shoes     buy adidas shoes        100
# 2      hot tea                shoes         NA
# 3        pizza                pizza        100
# 4 hill station I love to go to hill         50

数据

a <- c("adidas shoes","hot tea","pizza","hill station")
b <- c("shoes","plastic cup","pizza","I love to go to hill","travelling in motor van",
       "buy adidas shoes","run using adidas shoes")

匹配字符串而不使用循环

2 个答案: