从332 .csv文件中提取数据并返回文件中每个变量的观察个案数

时间:2014-12-13 20:38:53

标签: r csv

我正在编写一个R函数,它读取一个充满332 .csv文件的目录,并报告每个数据文件中完全观察到的案例数。该函数返回一个数据框,其中第一列是文件的名称,第二列是完整案例的数量。例如:

ID  OBS
1   233
2   149
etc.

这是我写的代码:

complete <- function(directory, id = 1:332) {
    files_full <- list.files(directory, full.names = TRUE)
    nobs <- sum(complete.cases(files_full[id]))
    data <- data.frame(id, nobs)
    return(data)

}

这里的问题是,当函数运行时,它为我的列中的每个“nobs”提供了值1。

5 个答案:

答案 0 :(得分:4)

有点不同的方法:

complete <- function(directory, pattern = "csv$") {
    setNames(as.data.frame(do.call(
            rbind,
            lapply(
                list.files(directory, pattern = pattern, full.names=TRUE),
                function(fname) list(fname, sum(complete.cases(read.csv(fname))))
            )
   )), c("file", "complete"))
}

如果您想将id作为参数:

complete <- function(directory, id = 1:332) {
    count_complete <- function(fname) sum(complete.cases(read.csv(fname)))
    fnames <- list.files(directory, full.names=TRUE)[id]
    data.frame(id = id, complete = unlist(lapply(fnames, count_complete)))
}

答案 1 :(得分:3)

sum(complete.cases(files_full[i]))没有多大意义,可能是你出错了。

我会这样做,

1-定义一个处理单个数据集的函数,

read_and_summarise <- function(f, ...) {d <- read.csv(f, ...) ; sum(complete.cases(d))}

2-将此功能应用于所有文件,

lf <- list.files(directory, full.names = TRUE)
vapply(lf, read_and_summarise, 0L)

(未测试的)

答案 2 :(得分:3)

让我们了解您的代码实际执行的操作:

complete <- function(directory, id = 1:332) {
    # list files
    files_full <- list.files(directory, full.names = TRUE)
    # create an empty placeholder, to grow sequentially. Known in some circles as R Inferno 
    # http://www.burns-stat.com/documents/books/the-r-inferno/
    dat <- data.frame()
    for (i in id) { # select filenames based on their position in the list 
                    # (prone to errors, because it depends on the order)
            dat <- rbind(dat, read.csv(files_full[i])) # read the data, and append it 
                                                       # to previous data.frame. Why??
            nobs <- sum(complete.cases(files_full[i])) # number of complete cases...
                                                       # in a character vector of length 1
            data <- data.frame(id, nobs)               # this gets overwritten every time
    }
    data
}

以下是您可能想写的内容:

complete <- function(directory, id = 1:332) {
    # list files
    files_full <- list.files(directory, full.names = TRUE)
    files_toread <- files_full[id] # filter out unwanted files (tip: ?grep is better)
    output <- data.frame(id = id, nobs = 0)
    for (i in id) { 
            tmp <- read.csv(files_toread[i]) # read the data
            nobs <- sum(complete.cases(tmp)) # number of complete cases
            output[i, "nobs"] <- nobs
    }
    output
}

答案 3 :(得分:1)

这是我的解决方案,似乎更容易阅读:

complete <- function(directory,id=1:332){
    filenames <- sprintf("%03d.csv", id)
    filePaths <- paste(directory, filenames, sep="/")
    nFiles=length(id)
    output <- matrix(ncol=2, nrow=nFiles)
    for(i in 1:nFiles){
        output[i,]= c(id[i],sum(complete.cases(read.csv(filePaths[i]))))
    }
    output <- setNames(data.frame(output),c("id","nobs"))
    output
}

希望这有助于某人。

答案 4 :(得分:0)

我认为这更简单易懂:

    complete <- function(dir, id = 1:332){

    dir <- list.files(dir, full.names = T)
    count <- data.frame()

    for(i in id){
            ok <- sum(complete.cases(read.csv(dir[i])))
            count <- rbind(count, ok)
    }
    count_table <- cbind(id, count)
    colnames(count_table) <- c("id", "nobs")
    count_table
    }