Question

我从excel切换到R，并想知道如何在R中执行此操作我有一个看起来像这样的数据集：

df1<-data.frame(Zipcode=c("7941AH","7941AG","7941AH","7941AZ"),
                From=c(2,30,45,1),
                To=c(20,38,57,8),
                Type=c("even","mixed","odd","mixed"),
                GPS=c(12345,54321,11221,22331)) 

df2<-data.frame(zipcode=c("7914AH", "7914AH", "7914AH", "7914AG","7914AG","7914AZ"), 
                housenum=c(18, 19, 50, 32, 104,11))

第一个数据集包含邮政编码，门牌号范围（从和到），如果范围包含偶数，奇数或混合门牌号和gps坐标，则输入含义。第二个数据集仅包含地址（邮政编码，门牌号）。

我想要做的是查找df2的gps坐标。例如，带有邮政编码7941AG和housenumber 18（偶数2到20之间）的地址有gps坐标12345。

更新因为它并没有让我想到数据集的大小对于所选择的解决方案（我知道，有点天真......）这里有一些额外的信息： df1的实际大小为472.000个观测值，df2的观测值为110万个。 df1中唯一的zipcodes数量为280.000。我偶然发现了这篇文章speed up the loop operation in R 有一些有趣的发现，但我不知道如何将其纳入@josilber提供的解决方案

Answer 1

我只是循环遍历df2中的每个元素，实现检查邮政编码是否匹配以及元素范围是否正确且偶数/奇数是否正确所需的逻辑：

# Clean up data (character zip codes and fix the 7914 vs. 7941 issue in zip codes)
df2<-data.frame(zipcode=c("7941AH", "7941AH", "7941AH", "7941AG","7941AG","7941AZ"), 
                housenum=c(18, 19, 50, 32, 104,11))
df1$Zipcode <- as.character(df1$Zipcode)
df2$zipcode <- as.character(df2$zipcode)

# Loop to compute the GPS values
sapply(seq(nrow(df2)), function(x) {
  m <- df2[x,]
  matched <- df1$Zipcode == m$zipcode &
    m$housenum >= df1$From &
    m$housenum <= df1$To &
    (df1$Type == "mixed" |
     (df1$Type == "odd" & m$housenum %% 2 == 1) |
     (df1$Type == "even" & m$housenum %% 2 == 0))
  if (sum(matched) != 1) {
    return(NA)  # No matches or multiple matches
  } else {
    return(df1$GPS[matched])
  }
})
# [1] 12345    NA    NA 54321    NA    NA

通过检查，只有df2的第一个和第四个元素与df1中的一个规则匹配。

Answer 2

鉴于大数据框架，您最好的选择可能是通过邮政编码合并df1和df2（也就是从数据框中获取相同邮政编码的每一对行），按门牌号标准过滤，删除重复项（来自df1的多个规则匹配的情况），然后存储有关所有匹配的房屋的信息。让我们从您指出的大小的样本数据集开始：

set.seed(144)
df1 <- data.frame(Zipcode=sample(1:280000, 472000, replace=TRUE),
                  From=sample(1:50, 472000, replace=TRUE),
                  To=sample(51:100, 472000, replace=TRUE),
                  Type=sample(c("even", "odd", "mixed"), 472000, replace=TRUE),
                  GPS=sample(1:100, 472000, replace=TRUE))
df2 <- data.frame(zipcode=sample(1:280000, 1.1e6, replace=TRUE),
                  housenum=sample(1:100, 1.1e6, replace=TRUE))

现在我们可以有效地计算GPS数据：

get.gps <- function(df1, df2) {
  # Add ID to df2
  df2$id <- 1:nrow(df2)
  m <- merge(df1, df2, by.x="Zipcode", by.y="zipcode")
  m <- m[m$housenum >= m$From &
         m$housenum <= m$To &
         (m$Type == "mixed" |
          (m$Type == "odd" & m$housenum %% 2 == 1) |
          (m$Type == "even" & m$housenum %% 2 == 0)),]
  m <- m[!duplicated(m$id) & !duplicated(m$id, fromLast=TRUE),]
  GPS <- rep(NA, nrow(df2))
  GPS[m$id] <- m$GPS
  return(GPS)
}
system.time(get.gps(df1, df2))
#    user  system elapsed 
#  16.197   0.561  17.583

这是一个更可接受的运行时间 - 18秒，而不是你在我的另一个答案的评论中估计的90小时！

R

2 个答案: