在R中处理大型数据集时如何优化和加快循环?

时间:2017-06-07 09:08:13

标签: r loops

目前,我正在进行数据转换。数据不是很大,大约有190k行。

我写了一个这样的for循环:

for (i in 1:nrow(df2)){
#a
record.a <- df[which(df$first_lat==df2[i,"third_lat"] 
            & df$first_lon==df2[i,"third_lon"] 
            & df$sixth_lat==df2[i,"fourth_lat"] 
            & df$sixth_lon==df2[i,"fourth_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,18] <- ifelse(nrow(record.a) != 0,record.a$order_cnt,NA)

#b
record.b <- df[which(df$fifth_lat==df2[i,"third_lat"] 
            & df$fifth_lon==df2[i,"third_lon"] 
            & df$sixth_lat==df2[i,"second_lat"] 
            & df$sixth_lon==df2[i,"second_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,19] <- ifelse(nrow(record.b) != 0,record.b$order_cnt,NA)

#c
record.c <- df[which(df$fifth_lat==df2[i,"first_lat"] 
            & df$fifth_lon==df2[i,"first_lon"] 
            & df$fourth_lat==df2[i,"second_lat"] 
            & df$fourth_lon==df2[i,"second_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,20] <- ifelse(nrow(record.c) != 0,record.c$order_cnt,NA)

#d
record.d <- df[which(df$third_lat==df2[i,"first_lat"] 
            & df$third_lon==df2[i,"first_lon"] 
            & df$fourth_lat==df2[i,"sixth_lat"] 
            & df$fourth_lon==df2[i,"sixth_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,21] <- ifelse(nrow(record.d) != 0,record.d$order_cnt,NA)

#e
record.e <- df[which(df$third_lat==df2[i,"fifth_lat"] 
            & df$third_lon==df2[i,"fifth_lon"] 
            & df$second_lat==df2[i,"sixth_lat"] 
            & df$second_lon==df2[i,"sixth_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,22] <- ifelse(nrow(record.e) != 0,record.e$order_cnt,NA)

#f
record.f <- df[which(df$first_lat==df2[i,"fifth_lat"] 
            & df$first_lon==df2[i,"fifth_lon"] 
            & df$second_lat==df2[i,"fourth_lat"] 
            & df$second_lon==df2[i,"fourth_lon"] 
            & df[,4]==df2[i,4] 
            & df[,3]==df2[i,5]),]
df2[i,23] <- ifelse(nrow(record.f) != 0,record.f$order_cnt,NA)
}

所以,基本上,我需要分别用6个标准从df中填写6列df2。在for循环中,nrow(df2)约为190k。它运行速度超慢。但我使用view(df2)来检查它并运行正常。那么有什么方法可以让它更快?我将来可能会将相同的数据转换应用于更大的数据集。

DF: df

DF2: df2

数据与地图上的网格有关。 df2基本上是df的一个子集,但添加了6个额外的列。 df和df2都具有相同的lon和lat信息。

每个grid_id代表地图中的六边形区域。每个六边形通过两对lon和lat连接到其他六个六边形。我想要做的是从六个周围六边形(以df为单位)中找到特定值,以填充df2中的列(a,b,c,d,e,f)。另外,我需要另外两个条件,即hours,ten_mins_interval。 (df [,4] == df2 [i,4]&amp; df [,3] == df2 [i,5]))

所以我认为逻辑是:

  1. 对于df2中的每个grid_id,hours,ten_mins_interval(1行)
  2. 找到相应小时的6个grid_ids(6行),df中的ten_mins_interval
  3. 将这6行中的order_cnt填入df2中的a,b,c,d,e,f列

1 个答案:

答案 0 :(得分:0)

如果您从当前的df2[,1:17]开始,可以使用merge命令添加df[,18]

df2 <- merge(df[,c("first_lat","first_lon","sixth_lat","sixth_lon","col4name","col5name","order_cn")],
      df2,
      by.x=c("first_lat","first_lon","sixth_lat","sixth_lon","col4name","col5name"),
      by.y=c("third_lat","third_lon","fourth_lat","fourth_lon","col4name","col3name"),
      all.y=TRUE)

您需要将col4name替换为第四列的名称,依此类推 - 我无法从屏幕截图中看到可能是什么。可以轻松生成此命令的另外五个版本以添加其他五个列。由于操作在整个向量上运行,它可能比循环更快。由于数据未以合适的格式提供,因此未经过测试。