将数据框的对应列与R中的列表进行比较

时间:2020-06-26 16:36:12

标签: r dplyr tidyr stringdist

我有一个包含用户数据的数据框

x <- data.frame("Address_line1" = c("461 road","PO Box 123","543 Highway"), 
                "City" = c("Dallas","Paris","New York" ), "Phone" = c("235","542","842"))
x
  Address_line1     City Phone
1      461 road   Dallas   235
2    PO Box 123    Paris   542
3   543 Highway New York   842

我有一个列表,具有与数据框相同的功能,并且具有相同的功能:

y = c("443 road","New york","842")
names(y) = colnames(x)

y

Address_line1          City         Phone 
   "443 road"    "New york"         "842"

我想遍历此数据帧的每一行,用y计算x对应字段的stringdist(),将这些值求和,以获得每一行的总分。

例如,第一行的分数将是:

row_1 = stringdist("461 road","443 road",method="lv") + stringdist("Dallas","New york",method="lv") + stringdist("235","842",method="lv")

row_1
[1] 13

类似地,我要为数据框的所有行都得分。这是我使用for循环编写的代码:

list_dist = rep(NA,0)

for(i in seq_len(nrow(x))){
    list_x = x[i,]
    sum=0
    for(j in seq_len(length(y))){
        sum = sum + stringdist(y[j],list_x[[j]],method = "lv")
    }
    #print(sum)
    list_dist[i] = sum
}


list_dist
[1] 13 18  8

我能够获得所需的输出,但是问题是计算时间。由于我的原始表包含约10万行和10列,因此运行代码需要将近30分钟。我想知道是否有更有效的方法来做到这一点。

1 个答案:

答案 0 :(得分:0)

这更快。

rowSums(mapply(stringdist, y, x, method = 'lv'))
#[1] 13 18  8

编辑

以下是带有较小x的计时。这些功能与软件包microbenchmark一起计时。

Rahul <- function(){
  list_dist = rep(NA,0)

  for(i in seq_len(nrow(x))){
      list_x = x[i,]
      sum=0
      for(j in seq_len(length(y))){
          sum = sum + stringdist(y[j],list_x[[j]],method = "lv")
      }
      #print(sum)
      list_dist[i] = sum
  }
  list_dist
}
Rui <- function(){
  rowSums(mapply(stringdist, y, x, method = 'lv'))
}

library(microbenchmark)

for(i in 1:6) x <- rbind(x,x)
dim(x)
[1] 192  3

mb <- microbenchmark(
  Rui = Rui(),
  Rahul = Rahul()
)

print(mb, unit = 'relative', order = 'median')
#Unit: relative
#  expr      min       lq     mean   median       uq      max neval
#   Rui   1.0000   1.0000   1.0000   1.0000   1.0000   1.0000   100
# Rahul 141.5944 137.4175 133.4313 134.4163 132.2977 119.6172   100

差异已经是2个数量级的magnutide,并且随着nrow(x)的增长而增大。

编辑2

在问题the docs之后,在nrow(y) x nrow(x)是向量或data.frame的情况下,下面的函数输出矩阵y

此功能不是上述速度测试中的功能Rui

rui <- function(x, y){
  out <- mapply(stringdistmatrix, y, x, MoreArgs = list(method = 'lv'), SIMPLIFY = FALSE)
  Reduce('+', out)
}

z <- data.frame(Address_line1 = c("443 road", "461 road"),
                City = c("New york", "London"), Phone = c("842", "524"))

rui(x, y)
#     [,1] [,2] [,3]
#[1,]   13   18    8

rui(x, z)
#     [,1] [,2] [,3]
#[1,]   13   18    8
#[2,]    9   17   19