我有一个tbl_df,想查看两个字符串之间匹配单词的百分比。
数据如下:
# A tibble 3 x 2
X Y
<chr> <chr>
1 "mary smith" "mary smith"
2 "mary smith" "john smith"
3 "mike williams" "jack johnson"
所需的输出(按任意顺序%):
# A tibble 3 x 3
X Y Z
<chr> <chr> <dbl>
1 "mary smith" "mary smith" 1.0
2 "mary smith" "john smith" 0.50
3 "mike williams" "jack johnson" 0.0
答案 0 :(得分:3)
base R
选项将是在length
按空格分隔列并将intesect
分开后,检查常见单词(split
的{{1}}
length
或者在df1$Z <- mapply(function(x, y) length(intersect(x, y))/length(x),
strsplit(df1$X, " "), strsplit(df1$Y, " "))
df$Z
#[1] 1.0 0.5 0.0
中,我们可以使用tidyverse
并应用相同的逻辑
map2
library(tidyverse)
df1 %>%
mutate(Z = map2(strsplit(X, " "), strsplit(Y, " "), ~
length(intersect(.x, .y))/length(.x)))
# X Y Z
#1 mary smith mary smith 1
#2 mary smith john smith 0.5
#3 mike williams jack johnson 0
答案 1 :(得分:2)
这是一个使用tidyverse
的{{1}}选项
stringr::str_split
或使用library(dplyr)
library(stringr)
df %>%
mutate(Z = map2(str_split(X, " "), str_split(Y, " "), ~sum(.x == .y) / length(.x)))
# X Y Z
#1 mary smith mary smith 1
#2 mary smith john smith 0.5
#3 mike williams jack johnson 0
stringi::stri_extract_all_words
library(stringi)
df %>%
mutate(Z = map2(stri_extract_all_words(X), stri_extract_all_words(Y), ~sum(.x == .y) / length(.x)))
答案 2 :(得分:0)
尝试在stringsim()
包中使用stringdist
:
library(stringdist)
tbl <- tibble(x = c("mary smith", "mary smith", "mike williams"),
y = c("mary smith", "john smith", "jack johnson"))
# lv = levenshtein distance
tbl %>% mutate(z = stringsim(x, y, method ='lv'))
# jw = jaro-winkler
tbl %>% mutate(z = stringsim(x, y, method ='jw'))
## > tbl %>% mutate(z = stringsim(x, y, method ='lv'))
## # A tibble: 3 x 3
## x y z
## <chr> <chr> <dbl>
## 1 mary smith mary smith 1.00
## 2 mary smith john smith 0.600
## 3 mike williams jack johnson 0.0769
## > tbl %>% mutate(z = stringsim(x, y, method ='jw'))
## # A tibble: 3 x 3
## x y z
## <chr> <chr> <dbl>
## 1 mary smith mary smith 1.00
## 2 mary smith john smith 0.733
## 3 mike williams jack johnson 0.494