我有两个包含有关用户的文本数据的数据框:
x <- data.frame("Address_line1" = c("123 Street","21 Hill drive"),
"City" = c("Chicago","London"), "Phone" = c("123","219"))
y <- data.frame("Address_line1" = c("461 road","PO Box 123","543 Highway"),
"City" = c("Dallas","Paris","New York" ), "Phone" = c("235","542","842"))
> x
Address_line1 City Phone
1 123 Street Chicago 123
2 21 Hill drive London 219
> y
Address_line1 City Phone
1 461 road Dallas 235
2 PO Box 123 Paris 542
3 543 Highway New York 842
对于x数据帧的每一行,我要遍历y中的所有行,比较相应的列(地址到地址,城市到城市等),并获取每个字符串的距离。
因此对于x的第一行,我想要一个输出:
[16 20 20]
16是哪里
stringdist("123 Street","461 road", method = "lv")+
stringdist("Chicago","Dallas", method = "lv")+
stringdist("123","235", method = "lv")
第二行的总和为20,第三行的总和为20。
类似地,我想要一个包含x的每一行的nrow(y)
元素的列表。
答案 0 :(得分:1)
我们可以使用for
循环
out <- c()
for(i in seq_len(nrow(x))) {
for(j in seq_len(nrow(y))) {
x1 <- x[i,]; y1 <- y[j,]
out <- c(out, sum(unlist(Map(stringdist, x1, y1,
MoreArgs = list(method = 'lv')))))
}
}
out
#[1] 16 20 20 19 20 21
目前尚不清楚预期。我们也可以使用tidyverse
方法
library(dplyr)
library(tidyr)
library(purrr)
library(stringdist)
library(stringr)
crossing(x, y, .name_repair = 'unique') %>%
rename_all(~ str_remove(., "\\.{2,}")) %>%
split.default(str_remove(names(.), "\\d+$")) %>%
map(~ pmap(.x, ~ stringdist(..1, ..2, method = 'lv'))) %>%
transpose %>%
map_dbl(~ flatten_dbl(.x) %>%
sum)
#[1] 16 20 20 19 21 20