Question

我在CSV文件中有一个类似于c("","1","1 1e-3")的列（即分隔的空格）。我正在尝试遍历所有值，取值sum()的值至少有一个值，否则返回NA。

我的代码目前的行为是这样的：

x <- c("","1","1 2 3")
x2 <- as.numeric(rep(NA,length(x)))
for (i in 1:length(x)) {
  si <- scan(text=x[[i]],quiet=TRUE)
  if (length(si) > 0)
    x2[[i]] <- sum(si)
}

我正努力做到这一点; x实际上是包含几十万行的CSV文件中的一组列，并认为应该可以在R中执行此操作。

（这些是来自可逆跳转MCMC算法后面的稀疏样本，因此在整个文件中维度变化时组合多个值，我想要有用的列。）

Answer 1

基于@Chase的想法，但处理NA并避免帮助函数的名称：

unlist(lapply(strsplit(x, " "),
              function(v)
                if (length(v) > 0)
                  sum(as.numeric(v))
                else
                  NA
      )      )

Answer 2

这似乎表现得更快，可能适合你。

#define a helper function
f <- function(x) sum(as.numeric(x))
unlist(lapply((strsplit(x3, " ")), f))
#-----
[1] 0 1 6

这将返回零而不是NA，但也许这不是你的交易破坏者？

让我们看看它如何扩展到更大的问题：

#set up variables
x3 <- rep(x, 1e5)
x4 <- as.numeric(rep(NA,length(x3)))
#initial approach
system.time(for (i in 1:length(x3)) {
  si <- scan(text=x3[[i]],quiet=TRUE)
  if (length(si) > 0)
    x4[[i]] <- sum(si)
})
#-----
   user  system elapsed 
   30.5     0.0    30.5 

#New approach:
system.time(unlist(lapply((strsplit(x3, " ")), f)))
#-----
   user  system elapsed 
   0.82    0.01    0.84

快速解析数字列表的方法

2 个答案: