Question

道歉，长时间拖延。有些事情浮出水面，还没有机会再来一次，以更清晰，更易于遵循的详细信息/数据/代码来更新帖子。

这里有一些数据。

data <- data.table(ZIP1 = c('99999', '99999', '99999', '99999', '99999'),
                      City1 = c('LOS ANGELAS', 'OAKLAND', 'SAN DIEGO', 'LOS ANGELOS', 'LOST ANGELOST'),
                      Name1 = c("JOHN", 'JOhn', 'JoN', 'JOHN', 'JOHNSON'))

data2 <- data.table(ZIP2 = c('99999', '12345', '99999', '99999', '99999'),
                   City2 = c('LOS ANGELAS', 'OAKLAND', 'SAN DIEGO', 'LOS ANGELOS', 'LOST ANGELOST'),
                   Name2 = c("JOHN", 'JOhn', 'JoN', 'JOHN', 'JOHNSON'))

zips<- data.table(zip = c('12345','45678','19899','99999','02345','98129','09101','10001','09839'))

编写今天的代码：

library('stringr')
library('stringdist')
library('readr')
library('data.table')

func1 <- function(df) {
  df2[] <- lapply(df2, as.character)
  df2$MatchName <- 1-stringdist(data$Name1, data2$Name2, method="jw", p=0.1)
  df2$MatchCity <- 1-stringdist(data$City1, data2$City2, method="jw", p=0.1)
  ##df1$glm <- predict(fit.glm, df1)  Overlay a model to predict if it's a match
  ##df1matches <- df1[glm == '1'] And then we write it somewhere else, SqlServer, disk, etc.
  #rm(df1) then we remove it as we loop through the next zip of matches
}                      

setkey(data, ZIP1)
setkey(data2, ZIP2)
setkey(zips, zip)

for(row in zips$zip) {
  #print(row)
  df1 <- data[ZIP1 %in% row] 
  df2 <- df1[data2, nomatch=0, allow.cartesian=TRUE]
  if (nrow(df1) == 0) {
    next
  }
  df2[,func1(.SD)]
}

哪个返回以下数据帧

 ZIP1         City1   Name1       City2 Name2 MatchName MatchCity
1: 99999   LOS ANGELAS    JOHN LOS ANGELAS  JOHN 0.7333333 0.5200216
2: 99999       OAKLAND    JOhn LOS ANGELAS  JOHN 0.7333333 0.5200216
3: 99999     SAN DIEGO     JoN LOS ANGELAS  JOHN 1.0000000 1.0000000
4: 99999   LOS ANGELOS    JOHN LOS ANGELAS  JOHN 1.0000000 1.0000000
5: 99999 LOST ANGELOST JOHNSON LOS ANGELAS  JOHN 1.0000000 1.0000000
6: 99999   LOS ANGELAS    JOHN   SAN DIEGO   JoN 0.7333333 0.5200216

本质上，我正在尝试并行运行循环以加快循环速度。根据“数据”的大小，此过程最多可能需要5个小时。由于大小和内存的限制，我们使用循环将数据集分割为较小的可管理部分。重申一下，我们今天的过程实际上是有效的，并且运作良好。希望可以通过foreach并使用并行后端来加速它。理想情况下，当我们将每一行插入数据库时，返回的结果是一个数据帧/数据表（也可以写入磁盘然后插入）。

    results = foreach(zips=iter(zips, by='row'), .combine=rbind) %dopar%  {
  df1 <- data[data$ZIP1 %in% row]
  df2 <- df1[data2, nomatch=0, allow.cartesian=TRUE]
  if (nrow(df1) == 0) {
  next
  }
df2[,func1(.SD)]
}

但是我的尝试是抛出“未定义的选定列”错误。

Foreach循环-找不到对象

0 个答案: