Question

我正在尝试从两个具有相同数量和列和行名称的数据库中处理R中的数据。一个数据库（database1）具有“1”和“ - ”，用于指示哪些单元格值得查看。另一个数据库（database2）只是充满了数据。

我正在尝试用' - '替换database2中所有'无价值的数据'（在database1中用' - '标记）。

我的代码效果很好，但它确实很慢。当然，每个电子表格中有1900行和~8000列，代码运行大约需要4个多小时，这是次优的。

我怎样才能让这段代码更快？一切都有帮助！谢谢！

这是代码（对于变量名称的赦免：P）：

for (n in 1:nrow(poopy)){
 list <- 0
 gooddates <- colnames(additions[which(additions[n,] == ' 1 ' | additions[n,] == '1')]) #some cells have a '1' and others a ' 1 ', so this accounts for both.
 for (j in 1:length(gooddates)){
   nextdateindex <- which(gooddates[j] == colnames(additions))+1  #database1 is by month. database2 is by day, so I'm taking the intervals of gooddates.
   if (is.na(colnames(additions)[nextdateindex])){
     nextdateindex <- '6.26.2014'
     couple <- cbind(gooddates[j], nextdateindex) #start and end intervals of gooddates
     list <- rbind(list, couple)
   }
   else{
     couple <- cbind(gooddates[j], colnames(additions)[nextdateindex])
     list <- rbind(list, couple)
   }
 }
 list <- list[-1,]

test <- poopy

if (is.null(nrow(list))){  ##some lists will only have one interval. this changes the indexing for some reason.
test <- test[n,-which(colnames(test) == list[1]):-(which(colnames(test) == list[2])-1)]
}
 else{
for (i in 1:nrow(list)){
  test <- test[n,-which(colnames(test) == list[i,1]):-(which(colnames(test) == list[i,2])-1)]
}
}

 poopy[n,which((test == "--") == FALSE)[-1]] <- '--'

}

编辑：Database1是每月，database2是每天，因此1s和--s不能从database1到database2一对一匹配。我假设数据库1中的1s在整个月内保持为1，这就是为什么我在'couple'变量中做一个范围，它将database1中的日期作为第一个列名称，而nextdateindex则是前一天Database1中的下一个数据点。希望这能澄清它！

非常接近，罗兰。谢谢你的尝试！

Answer 1

很难说没有输入数据，但可能是这样的：

#some artificial data
set.seed(42)
dat1 <- as.data.frame(matrix(rnorm(20), 5))
dat2 <- as.data.frame(matrix(sample(c(1, "--"),20, TRUE), 5))

#a one-liner
dat1[dat2=="1"] <- NA
dat1
#          V1          V2         V3         V4
# 1        NA -0.10612452         NA  0.6359504
# 2        NA  1.51152200         NA -0.2842529
# 3        NA -0.09465904         NA         NA
# 4 0.6328626  2.01842371 -0.2787888         NA
# 5        NA -0.06271410         NA  1.3201133

请注意我在结果中使用NA而不是"--"的方式，因为R有许多工具可以处理NA值，这些值似乎在您的数据中。

Answer 2

我最终创建了一个新的null表，其中包含database2的日期和rownames，名为additions2。然后，我使用数据选择了database1中的有效列，并将先前有效的行一直复制到下一个有效行，如下所示：

additions2 <- additions2[order(additions2$Security.Name),]

valid <- which(colnames(additions2) %in% intersect(colnames(additions2), colnames(additions)))

additions2[,valid] <- additions
valid <- valid[-1]

additions3 <- additions2
for (i in (2:length(valid)-1)){
  additions2[,valid[i]:(valid[i+1]-1)] <- additions[1+i]
}

additions22 <- additions2

additions22[,tail(valid,1):ncol(additions22)] <- additions[ncol(additions)]

在R中快速处理许多行

2 个答案: