以内存有效方式处理R数据帧的行而无循环

时间:2018-06-27 18:45:08

标签: r for-loop dataframe

我的数据帧data1的结构(超过150万行)是这样的:

data1 <- data.frame(NEW_UPC=c(11820005991,11820005991,11820005991,11820005991,11820005991,11820005991,11820005991,11820005991,11820005991,11820005991,11820005991,11820005991,11820005992,11820005992,11820005992,11820005992,11820005992,11820005992,11820005992,11820005992,11820005992,11820005993,11820005993,11820005993,11820005993,11820005993,11820005993,11820005993,11820005993,11820005993,11820005994,11820005994,11820005994,11820005994,11820005994,11820005994,11820005995,11820005995,11820005995,11820005995,11820005995,11820005995,11820005995,11820005995,11820005995),
                IRI_KEY=c(1073521,1073521,1073521,1073525,1073525,1073525,1078106,1078106,1078106,1078107,1078107,1078107,1073521,1073521,1073521,1073525,1073525,1073525,1078106,1078106,1078106,1073521,1073521,1073521,1073525,1073525,1073525,1078106,1078106,1078106,1073521,1073521,1073525,1073525,1078106,1078106,1073521,1073521,1073521,1073525,1073525,1073525,1078106,1078106,1078106),
                WEEK = c(1229,1230,1232,1218,1224,1229,1282,1285,1287,1229,1230,1232,1229,1230,1232,1218,1224,1229,1282,1285,1287,1229,1230,1232,1217,1221,1227,1270,1272,1273,1273,1274,1270,1272,1217,1221,1229,1230,1232,1218,1224,1229,1282,1285,1287),
                END=c(1232,1232,1232,1229,1229,1229,1287,1287,1287,1232,1232,1232,1232,1232,1232,1229,1229,1229,1287,1287,1287,1232,1232,1232,1227,1227,1227,1273,1273,1273,1274,1274,1272,1272,1221,1221,1232,1232,1232,1229,1229,1229,1287,1287,1287))

我需要使用列Exit.timeWEEK中的值以及截止值为1287的列END插入。Exit.time应该具有基于0或1的值按照以下逻辑:

如果WEEK = 1287,则Exit.time = 0。

如果Week不等于1287,但是WEEK = END,则Exit.time = 1,否则Exit.time = 0。

为此,我尝试了以下for循环,它完成了上述虚拟数据集中所需的操作。

i=0
for(i in 1:length(data2$NEW_UPC)){
  if (data2$WEEK[i]==1287) {
    data2$Exit.time[i] <- 0
  } else if(data2$WEEK[i]==data2$END[i]) {
    data2$Exit.time[i] <- 1
  } else {
    data2$Exit.time[i] <- 0
  }
}

问题是,当我在实际数据集中使用上述循环时,即使一个小时后也没有得到输出。我认为给定数据集的大小,循环效率不高。有其他方法可以做我想要的吗?我更喜欢保持data1中的行顺序,因为稍后需要进行一些合并操作。

3 个答案:

答案 0 :(得分:4)

由于当Exit.time时需要(WEEK == END) & WEEK != 1287为1,否则为0,因此可以对as.numeric的结果使用(WEEK == END) & WEEK != 1287,将TRUE更改为1FALSE0

data1$Exit.time <- with(data1, as.numeric(WEEK != 1287 & WEEK == END))

答案 1 :(得分:3)

有多种编码方法,主要是语义上的不同,它们基本上是在做同一件事

基本R:

data1$Exit.time <- (data1$WEEK != 1287 & data1$WEEK == data1$END)*1

这涉及大量键入data1,因此有一个快捷方式:

data1 <- within(data1, {
  Exit.time <- (WEEK != 1287 & WEEK == END)*1
})

Tidyverse: Tidyverse是一套非常适合处理数据的软件包。我们正在使用软件包dplyr,它是tidyverse的一部分,因此您可以加载整个文件,也可以只加载dplyr

library(tidyverse)
data1 <- data1 %>%
   mutate(
     Exit.time = (WEEK != 1287 & WEEK == END)*1
   )

(我通过乘以1来从TRUE / FALSE转换为0/1,输入的次数更少)

答案 2 :(得分:0)

使用data.table

setDT(data1)[, Exit.time := ifelse(WEEK == 1287, 0, ifelse(WEEK != 1287 & WEEK == END, 1, 0))]