提取日期和值以创建数据框

时间:2016-06-11 23:32:29

标签: r dataframe data-extraction

以下是我文件夹中的三个示例数据文件(文本文件)。我尝试提取与每个数据文件对应的日期,并为所有三个文件创建数据框。 以下是文件:

File1中:

### DEFAULTS ###
### OPTIONS ###
Title 2 left      $ Date and time: 2010-02-03 12:00 UTC
###  DATA  ### 
250 23.54300 0.00000 12.90000
500 17.47400 3.50000 21.70000
750 41.33200 22.10000 30.40000

文件2:

### DEFAULTS ###
### OPTIONS ###
Title 2 left      $ Date and time: 2010-02-10 12:00 UTC 
###  DATA  ###
250 36.95300 30.60000 27.10000
500 40.87700 37.80000 27.80000
750 41.46100 38.30000 30.70000

文件3:

### DEFAULTS ###
### OPTIONS ###
Title 2 left      $ Date and time: 2010-02-17 12:00 UTC 
###  DATA  ###
250 28.91200 24.90000 5.00000
500 49.82900 32.40000 11.10000
750 50.83600 40.60000 22.20000

以下是我为提取日期而编写的代码。

### Start R Code for extracting date
setwd("C:/Users/")
path = "~C:/Users"
file.names<- dir("C:/Users/", pattern =".dat")
file.names

fileNameVector <- NULL
dateNameVector <- NULL

for(i in 1:length(file.names)){

  x<-readLines(file.names[i])
  x
  date.list<-x[3]
  date.list

  xx<-strsplit(date.list,' ')
  xx

  date.fin<-unlist(xx)[13]

  fileNameVector <- rbind(fileNameVector, file.names[i])
  dateNameVector <- rbind(dateNameVector, date.fin)

}

finalDateName <- cbind(fileNameVector, dateNameVector)
finalDateName

并提取值:

setwd("C:/Users")
path = "~C:/Users"
st021ozone<-""
file.names<- dir("C:/Users", pattern =".dat")
for(i in 1:length(file.names)){
  file <- read.table(file.names[i],skip=4,header=FALSE, sep=" ", stringsAsFactors=FALSE)
  st021ozone <- rbind(st021ozone, file)
}
write.table(st021ozone, file = "st021ozone",sep=",", 
            row.names = FALSE, qmethod = "double",fileEncoding="windows-1252")

我想加入这两个代码并创建一个完整的数据框,其中日期位于表的左侧,对应于每个文件中的250值。最终的数据框将有9行5列。

这是期望的结果:

2010-02-03 250 23.54300 0.00000 12.90000
           500 17.47400 3.50000 21.70000
           750 41.33200 22.10000 30.40000
2010-02-10 250 36.95300 30.60000 27.10000
           500 40.87700 37.80000 27.80000
           750 41.46100 38.30000 30.70000
2010-02-17 250 28.91200 24.90000 5.00000
           500 49.82900 32.40000 11.10000
           750 50.83600 40.60000 22.20000

提前感谢您的帮助。

编辑:这是该文件的一个示例。我有53个类似的文件:

$
### DEFAULTS ###
Color scale       $ 
Contour levels    $ 
Contour colors    $ 
Contour type      $ 
Map limits        $ 
Map projection    $ 
### OPTIONS ###
Image order       $ 1
Title 1 left      $ Case 0236-004 - Vertical 
Vertical 
Title 2 left      $ Date and time: 2010-09-08 
Receptor code: Town021
Title 3 left      $ 
Title 4 left      $ 
Title 1 right     $ 
Title 2 right     $ tranport
Title 3 right     $ Start: 2010-01-01 00:00 UTC
Title 4 right     $   
gar-0236-004-01-20151103085553.png
###  DATA  ###
1 1 0 1 1.83400 32.00000 0.00000 21.20000
1 1 100 1 3.27300 0.00000 25.10000 21.90000
1 1 250 1 12.22200 0.00000 34.30000 25.60000
1 1 500 1 27.27400 0.00000 35.00000 31.30000
1 1 750 1 26.45300 0.00000 36.10000 35.90000
1 1 1000 1 32.62200 0.00000 36.40000 39.30000
1 1 1500 1 35.22700 0.00000 36.70000 42.20000
1 1 2000 1 37.90300 0.00000 37.40000 43.80000
1 1 3000 1 47.28200 0.00000 44.30000 46.90000
1 1 4000 1 51.01200 0.00000 49.00000 49.90000
1 1 5000 1 61.06500 0.00000 49.40000 51.00000

@alistaire感谢您的链接。这是我试图执行的代码。

setwd("C:/Users/")
path = "~C:/Users/"
test.sample<-""
files <- lapply(list.files(pattern = '\\.dat'), readLines) 
test.sample<- rbind(test.sample, files)
do.call(rbind, lapply(files, function(lines){
  # for each file, return a data.frame of the datetime, pulled with regex
   data.frame(datetime = as.POSIXct(sub('^.*Date and time: ', '', lines[grep('Date and time:', lines)])),
             # and the data, read in as text
         read.table(text = lines[(grep('DATA', lines) + 1):length(lines)]))
  }))
write.table(test.sample, file = "test.sample", sep="\t", qmethod ="double",row.names = FALSE,fileEncoding="windows-1252")

我分配了一个文件名并给出了一个路径。当我将变量名'test.sample'写入控制台时,我得到了一个矩阵:see image for the result 我想我错过了什么? NB。这里我使用的是一个包含44个文件的文件夹。

1 个答案:

答案 0 :(得分:1)

这些案件主要是乱砍乱七八糟的事情,尽管有些模式会让生活更轻松。由于这些文件至少是常规文件,因此这里不需要太多东西来获得有用的东西:

# read files into a list
files <- lapply(paste0('File', 1:3, '.dat'), readLines)

# loop across files (into list) with `lapply`, recombine list with `do.call(rbind, ...`
do.call(rbind, lapply(files, function(lines){
    # for each file, return a data.frame of the datetime, pulled with regex
    data.frame(datetime = as.POSIXct(sub('^.*Date and time: ', '', lines[3])),
               # and the data, read in as text
               read.table(text = lines[5:length(lines)]))
}))

#              datetime  V1     V2   V3   V4
# 1 2010-02-03 12:00:00 250 23.543  0.0 12.9
# 2 2010-02-03 12:00:00 500 17.474  3.5 21.7
# 3 2010-02-03 12:00:00 750 41.332 22.1 30.4
# 4 2010-02-10 12:00:00 250 36.953 30.6 27.1
# 5 2010-02-10 12:00:00 500 40.877 37.8 27.8
# 6 2010-02-10 12:00:00 750 41.461 38.3 30.7
# 7 2010-02-17 12:00:00 250 28.912 24.9  5.0
# 8 2010-02-17 12:00:00 500 49.829 32.4 11.1
# 9 2010-02-17 12:00:00 750 50.836 40.6 22.2

根据需要进行编辑。

修改

对于新文件结构,现在更加独立于行号:

do.call(rbind, lapply(files, function(lines){
    # for each file, return a data.frame of the datetime, pulled with regex
    data.frame(datetime = as.POSIXct(sub('^.*Date and time: ', '', lines[grep('Date and time:', lines)])),
               # and the data, read in as text
               read.table(text = lines[(grep('DATA', lines) + 1):length(lines)]))
}))

#      datetime V1 V2   V3 V4     V5 V6   V7   V8
# 1  2010-09-08  1  1    0  1  1.834 32  0.0 21.2
# 2  2010-09-08  1  1  100  1  3.273  0 25.1 21.9
# 3  2010-09-08  1  1  250  1 12.222  0 34.3 25.6
# 4  2010-09-08  1  1  500  1 27.274  0 35.0 31.3
# 5  2010-09-08  1  1  750  1 26.453  0 36.1 35.9
# 6  2010-09-08  1  1 1000  1 32.622  0 36.4 39.3
# 7  2010-09-08  1  1 1500  1 35.227  0 36.7 42.2
# 8  2010-09-08  1  1 2000  1 37.903  0 37.4 43.8
# 9  2010-09-08  1  1 3000  1 47.282  0 44.3 46.9
# 10 2010-09-08  1  1 4000  1 51.012  0 49.0 49.9
# 11 2010-09-08  1  1 5000  1 61.065  0 49.4 51.0