在读取一个目录中的所有文件时如何识别数据源?

时间:2017-08-04 16:35:30

标签: r dataframe

在一个目录中,有一些文件包:

cpu_server01.csv
cpu_server02.csv
cpu_server03.csv

我可以阅读文件的内容并将其附加到dflist,如下所示。但我需要在dflist中创建另一列并将文件名放在那里?

path("C:/Server/web/")
#cpu

filenames <- list.files(path, pattern="cpu_*", full.names=TRUE)

dflist <- lapply(filenames, function(i) {
  read.csv(i, header=T)

})

我如何将文件的名称添加到每个文件中?

Date Cpu filename

2 个答案:

答案 0 :(得分:2)

这应该有效:

for(i in 1:length(dflist))
  dflist[[i]]$file_name = filenames[i]

示例:

filenames=c("a","b")
dflist = list(head(mtcars,3),head(mtcars,3))

for(i in 1:length(dflist))
   dflist[[i]]$file_name = filenames[i]

输出:

[[1]]
               mpg cyl disp  hp drat    wt  qsec vs am gear carb file_name
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4         a
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4         a
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1         a

[[2]]
               mpg cyl disp  hp drat    wt  qsec vs am gear carb file_name
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4         b
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4         b
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1         b

答案 1 :(得分:0)

Florian's answer之外,还有两种处理这种常见情况的替代方法。

将列表元素命名为

如果您计划将rbind()个文件放入一个大型数据对象(参见下面的示例),则将文件名复制为单个data.frames的列只会感觉到恕我直言。

如果要在列表中单独保留每个data.frame,您可以适当地命名列表元素,例如,

path <- "."
# get vector of filenames, note that pattern includes the cvs extension
filenames <- list.files(path, pattern = "cpu_.*csv$", full.names = TRUE)
# read files as a list of data.frames
dflist <- lapply(filenames, read.csv, header = TRUE)
# rename list element using file names without path
names(dflist) <- basename(filenames)

请注意,在调用lapply()时没有必要定义匿名函数,因为lapply()将无法识别的参数传递给被调用函数。所以,我们可以简明扼要地写出

lapply(filenames, read.csv, header = TRUE)

而不是

lapply(filenames, function(i) read.csv(i, header = TRUE)) 

现在,dflist已正确命名

$cpu_server01.csv
  V1   V2
1  A 1001
2  B 1002
3  C 1003

$cpu_server02.csv
  V1   V2
1  A 2001
2  B 2002
3  C 2003

$cpu_server03.csv
  V1   V2
1  A 3001
2  B 3002
3  C 3003

识别组合数据对象中的源文件

如果目标是将所有数据块组合在一个大型数据对象中,则需要识别每行的原始源文件。

这可以通过Florian's approach和随后的rbinding来实现。或者,我们可以使用data.table&#39; rbindlist()函数。

如果列表元素已按上述方式命名,我们只需添加:

combi <- data.table::rbindlist(dflist, idcol = "file.name")
combi
          file.name V1   V2
1: cpu_server01.csv  A 1001
2: cpu_server01.csv  B 1002
3: cpu_server01.csv  C 1003
4: cpu_server02.csv  A 2001
5: cpu_server02.csv  B 2002
6: cpu_server02.csv  C 2003
7: cpu_server03.csv  A 3001
8: cpu_server03.csv  B 3002
9: cpu_server03.csv  C 3003

rbindlist()创建了id列&#34; file.name&#34;并使用列表元素的名称填充它。

或者,我们可以先调用rbindlist()并将文件名添加为因子:

library(data.table)
path <- "."
# get vector of filenames, note that pattern includes the cvs extension
filenames <- list.files(path, pattern = "cpu_.*csv$", full.names = TRUE)
# read files as a list of data.frames and combine immediately
combi <- rbindlist(lapply(filenames, read.csv, header = TRUE), idcol = "file.name")
# change file number to appropriately labeled factor
combi[, file.name := factor(file.name, labels = basename(filenames))][]
          file.name V1   V2
1: cpu_server01.csv  A 1001
2: cpu_server01.csv  B 1002
3: cpu_server01.csv  C 1003
4: cpu_server02.csv  A 2001
5: cpu_server02.csv  B 2002
6: cpu_server02.csv  C 2003
7: cpu_server03.csv  A 3001
8: cpu_server03.csv  B 3002
9: cpu_server03.csv  C 3003

数据

为了再现性,虚拟文件​​由

创建
idx_vec <- 1:3
invisible(sapply(1:3, function(i) {
  x <- data.frame(V1 = LETTERS[idx_vec], V2 = 1000L * i + idx_vec)
  write.csv(x, sprintf("cpu_server%02i.csv", i), row.names = FALSE)
}))