Question

我有35,000行。如果preventive_chem值为＆＃34; Y＆＃34;，则prev_efficacy的值在过去3天为5,3,1，在下一周为10到1。示例输出位于img文件中。

Prev_Chem Date  prev_effi
0   7/3/2016    0   
0   7/4/2016    0   
0   7/5/2016    1   
0   7/6/2016    3   
Y   7/7/2016    5   
0   7/8/2016    10  
0   7/9/2016    9   
0   7/10/2016   8   
0   7/11/2016   7   
0   7/12/2016   6   
0   7/13/2016   5   
0   7/14/2016   4   
0   7/15/2016   3   
0   7/16/2016   2   
0   7/17/2016   1
0   7/18/2016   0
0   7/19/2016   0

如果preventive_chem值为0，则prev_efficacy值为0.

当我尝试这段代码时，

*df$PreventEffic <- rep(0,nrow(df))
for(i in 1:nrow(df))
   {
     if(df$Preventive_Chem1[i] == "Y") 
       {   
       df$PreventEffic[i] <- 5
       df$PreventEffic[i-2] <- 1
       df$PreventEffic[i-1] <- 3
       df$PreventEffic[i+1] <- 10
       df$PreventEffic[i+2] <- 9
       df$PreventEffic[i+3] <- 8
       df$PreventEffic[i+4] <- 7
       df$PreventEffic[i+5] <- 6
       df$PreventEffic[i+6] <- 5
       df$PreventEffic[i+7] <- 4
       df$PreventEffic[i+8] <- 3
       df$PreventEffic[i+9] <- 2
       df$PreventEffic[i+10] <- 1
       }
     }*

运行代码并返回值0到1016321行将花费大量时间。有没有有效的方法来处理这个问题而不使用＆＃34; for循环＆＃34;。

Answer 1

假设您的数据框架结构是一致的 - 即在Y发生前2天和10天后，您不需要for循环，只需找到“Y”的索引并使用此为每个+/-天分配值：

indx <- which(df$Prev_Chem == "Y")
df$PreventEffic <- rep(0,nrow(df))
df$PreventEffic[indx] <- 5
df$PreventEffic[indx-2] <- 1
df$PreventEffic[indx-1] <- 3
df$PreventEffic[indx+1] <- 10
df$PreventEffic[indx+2] <- 9
df$PreventEffic[indx+3] <- 8
df$PreventEffic[indx+4] <- 7
df$PreventEffic[indx+5] <- 6
df$PreventEffic[indx+6] <- 5
df$PreventEffic[indx+7] <- 4
df$PreventEffic[indx+8] <- 3
df$PreventEffic[indx+9] <- 2
df$PreventEffic[indx+10] <- 1

Answer 2

代码中的两个主要低效问题：

预先计算有趣行的位置。不是逐行循环，而是进行矢量化比较。
由于您要将固定的数字向量分配给紧邻每个匹配行的位置区域，因此您也可以在向量中进行赋值。

首先（但天真）的实施可能是：

n <- 32
df <- data.frame(x = rep(0, n), y = 0)
df$x[c(5,20)] <- 1
str(df)
# 'data.frame': 32 obs. of  2 variables:
#  $ x: num  0 0 0 0 1 0 0 0 0 0 ...
#  $ y: num  0 0 0 0 0 0 0 0 0 0 ...

for (i in which(df$x == 1))
  df$y[i + -2:10] <- c(1,3,5,10:1)
df
#    x  y
# 1  0  0
# 2  0  0
# 3  0  1
# 4  0  3
# 5  1  5
# 6  0 10
# 7  0  9
# 8  0  8
# 9  0  7
# 10 0  6
# 11 0  5
# 12 0  4
# 13 0  3
# 14 0  2
# 15 0  1
# 16 0  0
# 17 0  0
# 18 0  1
# 19 0  3
# 20 1  5
# 21 0 10
# 22 0  9
# 23 0  8
# 24 0  7
# 25 0  6
# 26 0  5
# 27 0  4
# 28 0  3
# 29 0  2
# 30 0  1
# 31 0  0
# 32 0  0

但人们应该很快就会想到，当data.frame底部有一行不到10行时会发生什么。也就是说，您可能会看到类似于以下错误：

# Error in `$<-.data.frame`(`*tmp*`, "y", value = c(0, 0, 1, 3, 5, 10, 9,  : 
#   replacement has 30 rows, data has 28

然后你可以尝试这个（请原谅令人发指的变量命名）：

for (i in which(df$x == 1)) {
  j <- c(-2:0, head(1:10, n = dfn - i))
  k <- c(1,3,5, head(10:1, n = dfn - i))
  df$y[i + j] <- k
}

head(..., n=dfn-i)确保我们永远不会有比我们预先存在的行要修改更多的替换数据。

R for循环，为新变量

2 个答案: