将多个txt文件导入一个数据框,并将部分文件名用作“id”

时间:2017-11-29 15:34:12

标签: r dplyr

我有一个使用以下约定命名的文本文件目录:“Location[A-Z]_House[0-15]_Day[0_15].txt”,因此示例是LA_H05_D14.txt。有没有办法分割名称,使它们成为一个因素?更具体地说,我想使用位置之后的字母[A-Z]。例如。 LB_H01_D01.txt是位置“B”,属于位置B的所有数据都标记为“B”?

我已将文件中的所有数据导入一个数据框:

l = list.files(patt="txt$", full.names = T)
library(dplyr)

Df = bind_rows(lapply(l,  function(i) {temp <- read.table(i,stringsAsFactors = FALSE,sep=";"); 
setNames(temp, c("Date","Time","Timestamp","PM2_5(ug/m3)","AQI(US)","AQI(CN)","PM10(ug/m3)","Outdoor AQI(US)","Outdoor AQI(CN)","Temperature(C)","Temperature(F)","Humidity(%RH)","CO2(ppm)","VOC(ppb)"
))}), .id = "id")

数据看起来像这样,带有“id”列:

head(Df)
  id       Date     Time  Timestamp PM2_5(ug/m3) AQI(US) AQI(CN) PM10(ug/m3) Outdoor AQI(US) Outdoor AQI(CN) Temperature(C) Temperature(F)
1  1 2017/10/17 20:31:38 1508272298        102.5     175     135         512               0               0             30           86.1
2  1 2017/10/17 20:31:48 1508272308         93.6     171     124         477               0               0             30           86.1
3  1 2017/10/17 20:31:58 1508272318         98.0     173     129         397               0               0             30           86.0
4  1 2017/10/17 20:32:08 1508272328         98.0     173     129         422               0               0             30           86.0
5  1 2017/10/17 20:32:18 1508272338        104.3     176     137         466               0               0             30           86.0
6  1 2017/10/17 20:32:28 1508272348        101.6     175     134         528               0               0             30           86.0
  Humidity(%RH) CO2(ppm) VOC(ppb)
1            43      466       -1
2            43      467       -1
3            42      468       -1
4            42      469       -1
5            42      471       -1
6            42      471       -1

2 个答案:

答案 0 :(得分:2)

独立于有关id列内容的问题,您可以使用以下代码从文件名中提取信息:

#you may use the original filenames
filenames <- basename(l)
#or the content of the id column
filenames <- as.character(Df$id) #if you have read in filenames in the Df
#for demonstration here a definition of exemplary filenames
filenames <- c("LA_H01_D01.txt"
               ,"LA_H02_D02.txt"
               ,"LD_H01_D14.txt"
               ,"LD_H01_D15.txt")

filenames <- gsub("_H|_D", "_", filenames)
filenames <- gsub(".txt|^L", "", filenames)

fileinfo <- as.data.frame(do.call(rbind, strsplit(filenames, "_")))
colnames(fileinfo) <- c("Location", "House", "Day")

fileinfo[, c("House", "Day")] <- apply(fileinfo[, c("House", "Day")], 2, as.numeric)
#      Location House Day
# 1        A     1   1
# 2        A     2   2
# 3        D     1  14
# 4        D     1  15

#add the information to your Df as new columns
Df <- cbind(Df, fileinfo)

#the whole thing as a function used in your data import
add_fileinfo <- function(df, filename) {

  filename <- gsub("_H|_D", "_", filename)
  filename <- gsub(".txt|^L", "", filename)

  fileinfo <- as.data.frame(do.call(rbind, strsplit(filename, "_")))
  colnames(fileinfo) <- c("Location", "House", "Day")

  fileinfo[, c("House", "Day")] <- apply(fileinfo[, c("House", "Day")], 2, as.numeric)

  cbind(df,  fileinfo[rep(seq_len(nrow(fileinfo)), each= nrow(df)),])

}

Df = bind_rows(lapply(l,  function(i) 
{temp <- read.table(i,stringsAsFactors = FALSE,sep=";"); 
setNames(temp, c("Date","Time","Timestamp","PM2_5(ug/m3)","AQI(US)","AQI(CN)","PM10(ug/m3)","Outdoor AQI(US)","Outdoor AQI(CN)","Temperature(C)","Temperature(F)","Humidity(%RH)","CO2(ppm)","VOC(ppb)"
));
temp <- add_fileinfo(temp, i);
}
), .id = "id")

答案 1 :(得分:1)

像这样(通用)解决方案应该让你前进。

mydata1 = read.csv(path1, header=T)
mydata2 = read.csv(path2, header=T)

然后,合并

myfulldata = merge(mydata1, mydata2)

只要mydata1和mydata2至少有一个具有相同名称的公共列(允许在mydata1中匹配观察到mydata2中的观察),这将像魅力一样工作。它还需要三行。

如果我有20个文件包含我想要观察观察的数据怎么办?假设它们都有一个允许合并的公共列,我仍然需要读取20个文件(20行代码)和merge()二乘二...所以我可以将20个数据帧与19个合并语句合并像这样:

mytempdata = merge(mydata1, mydata2)
mytempdata = merge(mytempdata, mydata3)
.
.
.
mytempdata = merge(mytempdata, mydata20)

这很乏味。您可能正在寻找一种更简单的方法。如果你是,我写了一个函数来解决你的困境,称为multmerge()。*这是定义函数的代码:

multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T)})
Reduce(function(x,y) {merge(x,y)}, datalist)

这是一个很好的资源,可以帮助你。

https://stats.idre.ucla.edu/r/codefragments/read_multiple/