遍历R中的两个数据帧并比较对应的列值

时间:2020-06-24 23:09:16

标签: r string dplyr stringr stringdist

我有两个包含有关用户的文本数据的数据框:

x <- data.frame("Address_line1" = c("123 Street","21 Hill drive"), 
                "City" = c("Chicago","London"), "Phone" = c("123","219"))
y <- data.frame("Address_line1" = c("461 road","PO Box 123","543 Highway"), 
                "City" = c("Dallas","Paris","New York" ), "Phone" = c("235","542","842"))
> x
  Address_line1     City Phone
1    123 Street  Chicago   123
2 21 Hill drive   London   219


> y
  Address_line1     City Phone
1      461 road   Dallas   235
2    PO Box 123    Paris   542
3   543 Highway New York   842

对于x数据帧的每一行,我要遍历y中的所有行,比较相应的列(地址到地址,城市到城市等),并获取每个字符串的距离。

因此对于x的第一行,我想要一个输出:

[16 20 20]

16是哪里

stringdist("123 Street","461 road", method = "lv")+
stringdist("Chicago","Dallas", method = "lv")+
stringdist("123","235", method = "lv") 

第二行的总和为20,第三行的总和为20。

类似地,我想要一个包含x的每一行的nrow(y)元素的列表。

1 个答案:

答案 0 :(得分:1)

我们可以使用for循环

out <- c()
for(i in seq_len(nrow(x))) {
    for(j in seq_len(nrow(y))) {
     x1 <- x[i,]; y1 <- y[j,]
     out <- c(out, sum(unlist(Map(stringdist, x1, y1, 
          MoreArgs = list(method = 'lv')))))
      }
 }

out
#[1] 16 20 20 19 20 21

目前尚不清楚预期。我们也可以使用tidyverse方法

library(dplyr)
library(tidyr)
library(purrr)
library(stringdist)
library(stringr)
crossing(x, y, .name_repair = 'unique') %>%
   rename_all(~ str_remove(., "\\.{2,}")) %>% 
   split.default(str_remove(names(.), "\\d+$")) %>%
   map(~ pmap(.x,  ~ stringdist(..1, ..2, method = 'lv'))) %>% 
   transpose %>% 
   map_dbl(~ flatten_dbl(.x) %>% 
            sum)
#[1] 16 20 20 19 21 20