在R中的两个不同列(按行)中添加具有匹配词百分比的列

时间:2018-08-09 13:15:51

标签: r string dplyr tidyverse

我有一个tbl_df,想查看两个字符串之间匹配单词的百分比。

数据如下:

# A tibble 3 x 2
       X                 Y
     <chr>             <chr>
1 "mary smith"      "mary smith"
2 "mary smith"      "john smith"
3 "mike williams"   "jack johnson"

所需的输出(按任意顺序%):

# A tibble 3 x 3 
       X               Y           Z 
     <chr>           <chr>        <dbl>
1 "mary smith"    "mary smith"     1.0 
2 "mary smith"    "john smith"     0.50 
3 "mike williams" "jack johnson"   0.0

3 个答案:

答案 0 :(得分:3)

base R选项将是在length按空格分隔列并将intesect分开后,检查常见单词(split的{​​{1}}

length

或者在df1$Z <- mapply(function(x, y) length(intersect(x, y))/length(x), strsplit(df1$X, " "), strsplit(df1$Y, " ")) df$Z #[1] 1.0 0.5 0.0 中,我们可以使用tidyverse并应用相同的逻辑

map2

数据

library(tidyverse)
df1 %>% 
  mutate(Z = map2(strsplit(X, " "), strsplit(Y, " "), ~ 
                       length(intersect(.x, .y))/length(.x)))
 #             X            Y   Z
#1    mary smith   mary smith   1
#2    mary smith   john smith 0.5
#3 mike williams jack johnson   0

答案 1 :(得分:2)

这是一个使用tidyverse的{​​{1}}选项

stringr::str_split

或使用library(dplyr) library(stringr) df %>% mutate(Z = map2(str_split(X, " "), str_split(Y, " "), ~sum(.x == .y) / length(.x))) # X Y Z #1 mary smith mary smith 1 #2 mary smith john smith 0.5 #3 mike williams jack johnson 0

stringi::stri_extract_all_words

样本数据

library(stringi)
df %>%
    mutate(Z = map2(stri_extract_all_words(X), stri_extract_all_words(Y), ~sum(.x == .y) / length(.x)))

答案 2 :(得分:0)

尝试在stringsim()包中使用stringdist

library(stringdist)

tbl <- tibble(x = c("mary smith", "mary smith", "mike williams"),
              y = c("mary smith", "john smith", "jack johnson"))

# lv = levenshtein distance
tbl %>% mutate(z = stringsim(x, y, method ='lv'))

# jw =  jaro-winkler 
tbl %>% mutate(z = stringsim(x, y, method ='jw'))

## > tbl %>% mutate(z = stringsim(x, y, method ='lv'))
## # A tibble: 3 x 3
##  x             y                 z
##  <chr>         <chr>         <dbl>
## 1 mary smith    mary smith   1.00  
## 2 mary smith    john smith   0.600 
## 3 mike williams jack johnson 0.0769

## > tbl %>% mutate(z = stringsim(x, y, method ='jw'))
## # A tibble: 3 x 3
##   x             y                z
##  <chr>         <chr>        <dbl>
## 1 mary smith    mary smith   1.00 
## 2 mary smith    john smith   0.733
## 3 mike williams jack johnson 0.494