Question

我有多个包含18或20列的文本文件。我想将所有文件绑定在一起，但为了做到这一点，我必须删除包含20列的文件中的前两列（两列是日期和时间）。

我无法找到如何解决问题（我只是R中的新手）“参数列数不匹配”。因此，我想确定文件的前两行是否被称为日期和时间，然后删除这些列。这是我正在处理的代码：

file_list <- list.files()

for (file in file_list){
    if (!exists("dataset")){
        dataset <- read.table(file, header=TRUE, sep="\t", stringsAsFactors=FALSE)
    if (colnames(dataset)[1] == "date" & colnames(dataset)[2] == "time"){
        dataset$date <- NULL
        dataset$time <- NULL
    }
}

if (exists("dataset")){
    temp_dataset <-read.table(file, header=TRUE, sep="\t", stringsAsFactors=FALSE)
    dataset<-rbind(dataset, temp_dataset)
    rm(temp_dataset)
    }

}

谢谢！

Answer 1

正如@ user5249203评论的那样，如果您在加载之前根据文件名（或其他内容）知道某个文件包含太多列，那么您可以通过编程方式跳过列。如果没有，请继续。

我将假设您正在使用以下内容读取文件：

fnames <- list.files(pattern = "*.csv", path = "some/dir") # replace `read.csv` with whichever function you're using to read in the data alldata <- sapply(fnames, read.csv, stringsAsFactors = FALSE, simplify = FALSE)

缺少任何类似的文件，我会生成一个假的alldata列表：

set.seed(42) fnames <- paste0("mtcars", 1:5) alldata <- sapply(fnames, function(fn) { if (runif(1) < 0.7) mtcars[,-1] else mtcars }) # should have 3 with 11 columns, 2 with 10 columns sapply(alldata, ncol) # mtcars1 mtcars2 mtcars3 mtcars4 mtcars5 # 11 11 10 11 10

毫不奇怪，我们无法使用基础R来对抗它们：

do.call("rbind", alldata) # Error in rbind(deparse.level, ...) : # numbers of columns of arguments do not match

dplyr

我们可以使用dplyr::bind_rows，但它会保留不需要的列，导致较窄表格中该列的值为NA：

library(dplyr) str( bind_rows(alldata) ) # 'data.frame': 160 obs. of 11 variables: # $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... # $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... # $ disp: num 160 160 108 258 360 ... # $ hp : num 110 110 93 110 175 105 245 62 95 123 ... # $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... # $ wt : num 2.62 2.88 2.32 3.21 3.44 ... # $ qsec: num 16.5 17 18.6 19.4 17 ... # $ vs : num 0 0 1 1 0 1 0 1 1 1 ... # $ am : num 1 1 1 0 0 0 0 0 0 0 ... # $ gear: num 4 4 4 3 3 3 3 4 4 4 ... # $ carb: num 4 4 1 1 2 1 4 2 2 4 ...

您在此str摘要中未看到的内容是mpg个变量中的一些NA：

table(is.na(bind_rows(alldata)$mpg)) # FALSE TRUE # 96 64

（如果需要，将其删除。）

基础R

（假设您选择不使用dplyr）。从这里开始，列出alldata的实际列表：

numColumnsWanted <- 10 # you want this to be 18, I think alldata2 <- lapply(alldata, function(dat) { # this grabs the *last* 'numColumnsWanted' columns if (ncol(dat) > numColumnsWanted) dat[, 1 + ncol(dat) - numColumnsWanted:1] else dat })

验证data.frames的大小是否相同。（您可能还应该验证列名称：

sapply(alldata2, ncol) # mtcars1 mtcars2 mtcars3 mtcars4 mtcars5 # 10 10 10 10 10

现在你应该能够安全地训练他们了：

str( do.call("rbind", alldata2) ) # 'data.frame': 160 obs. of 10 variables: # $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... # $ disp: num 160 160 108 258 360 ... # $ hp : num 110 110 93 110 175 105 245 62 95 123 ... # $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... # $ wt : num 2.62 2.88 2.32 3.21 3.44 ... # $ qsec: num 16.5 17 18.6 19.4 17 ... # $ vs : num 0 0 1 1 0 1 0 1 1 1 ... # $ am : num 1 1 1 0 0 0 0 0 0 0 ... # $ gear: num 4 4 4 3 3 3 3 4 4 4 ... # $ carb: num 4 4 1 1 2 1 4 2 2 4 ...

（此解决方案中不存在$mpg。）

Answer 2

考虑在列名称上使用带有lapply()倒置的grep()来移除date和time。无论这两列位于何处，或者它们是否存在于较小的集合中，下面都可以使用。

dfList <- lapply(file_list, function(f) {
                    df <- read.table(f, header=TRUE, sep="\t", stringsAsFactors=FALSE)
                    df <- df[grep("(date|time)", names(df), invert = TRUE)]
                 })

finaldf <- do.call(rbind, dfList)

或者，不匹配正则表达式模式而不使用invert = TRUE：

dfList <- lapply(file_list, function(f) {
                    df <- read.table(f, header=TRUE, sep="\t", stringsAsFactors=FALSE)
                    df <- df[grep("[^(date|time)]", names(df))]
                 }) 

finaldf <- do.call(rbind, dfList)

Answer 3

感谢您的建议！

一个对我有用的解决方案是替换

dataset<-rbind(dataset, temp_dataset)

通过

dataset<-rbind.fill(dataset, temp_dataset)

缺少数据被NA取代，我可以轻松删除不完整的列。

删除R中多个文件中的列

3 个答案:

dplyr

基础R