Question

目前，我正在进行数据转换。数据不是很大，大约有190k行。

我写了一个这样的for循环：

for (i in 1:nrow(df2)){
#a
record.a <- df[which(df$first_lat==df2[i,"third_lat"] 
            & df$first_lon==df2[i,"third_lon"] 
            & df$sixth_lat==df2[i,"fourth_lat"] 
            & df$sixth_lon==df2[i,"fourth_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,18] <- ifelse(nrow(record.a) != 0,record.a$order_cnt,NA)

#b
record.b <- df[which(df$fifth_lat==df2[i,"third_lat"] 
            & df$fifth_lon==df2[i,"third_lon"] 
            & df$sixth_lat==df2[i,"second_lat"] 
            & df$sixth_lon==df2[i,"second_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,19] <- ifelse(nrow(record.b) != 0,record.b$order_cnt,NA)

#c
record.c <- df[which(df$fifth_lat==df2[i,"first_lat"] 
            & df$fifth_lon==df2[i,"first_lon"] 
            & df$fourth_lat==df2[i,"second_lat"] 
            & df$fourth_lon==df2[i,"second_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,20] <- ifelse(nrow(record.c) != 0,record.c$order_cnt,NA)

#d
record.d <- df[which(df$third_lat==df2[i,"first_lat"] 
            & df$third_lon==df2[i,"first_lon"] 
            & df$fourth_lat==df2[i,"sixth_lat"] 
            & df$fourth_lon==df2[i,"sixth_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,21] <- ifelse(nrow(record.d) != 0,record.d$order_cnt,NA)

#e
record.e <- df[which(df$third_lat==df2[i,"fifth_lat"] 
            & df$third_lon==df2[i,"fifth_lon"] 
            & df$second_lat==df2[i,"sixth_lat"] 
            & df$second_lon==df2[i,"sixth_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,22] <- ifelse(nrow(record.e) != 0,record.e$order_cnt,NA)

#f
record.f <- df[which(df$first_lat==df2[i,"fifth_lat"] 
            & df$first_lon==df2[i,"fifth_lon"] 
            & df$second_lat==df2[i,"fourth_lat"] 
            & df$second_lon==df2[i,"fourth_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,23] <- ifelse(nrow(record.f) != 0,record.f$order_cnt,NA)
}

所以，基本上，我需要分别用6个标准从df中填写6列df2。在for循环中，nrow（df2）约为190k。它运行速度超慢。但我使用view（df2）来检查它并运行正常。那么有什么方法可以让它更快？我将来可能会将相同的数据转换应用于更大的数据集。

DF： df

DF2： df2

数据与地图上的网格有关。 df2基本上是df的一个子集，但添加了6个额外的列。 df和df2都具有相同的lon和lat信息。

每个grid_id代表地图中的六边形区域。每个六边形通过两对lon和lat连接到其他六个六边形。我想要做的是从六个周围六边形（以df为单位）中找到特定值，以填充df2中的列（a，b，c，d，e，f）。另外，我需要另外两个条件，即hours，ten_mins_interval。（df [，4] == df2 [i，4]＆amp; df [，3] == df2 [i，5]））

所以我认为逻辑是：

对于df2中的每个grid_id，hours，ten_mins_interval（1行）
找到相应小时的6个grid_ids（6行），df中的ten_mins_interval
将这6行中的order_cnt填入df2中的a，b，c，d，e，f列

Answer 1

如果您从当前的df2[,1:17]开始，可以使用merge命令添加df[,18]：

df2 <- merge(df[,c("first_lat","first_lon","sixth_lat","sixth_lon","col4name","col5name","order_cn")],
      df2,
      by.x=c("first_lat","first_lon","sixth_lat","sixth_lon","col4name","col5name"),
      by.y=c("third_lat","third_lon","fourth_lat","fourth_lon","col4name","col3name"),
      all.y=TRUE)

您需要将col4name替换为第四列的名称，依此类推 - 我无法从屏幕截图中看到可能是什么。可以轻松生成此命令的另外五个版本以添加其他五个列。由于操作在整个向量上运行，它可能比循环更快。由于数据未以合适的格式提供，因此未经过测试。

在R中处理大型数据集时如何优化和加快循环？

1 个答案: