我正在尝试"导入"来自具有多个降水率报告的非data.frame形状文本文件中的数据。报告都是平等的,一个样本如下:
I D E A M - INSTITUTO DE HIDROLOGIA, METEOROLOGIA Y ESTUDIOS AMBIENTALES
INFORMATION SYSTEM
PRECIPITATION TOTAL VALUES (mms) NATIONAL ENVIRONMENTAL
DATE OF PROCESS : 2015/09/15 YEAR 1980 STATION ID : 11010010 VUELTA LA
LAT 0527 N TIPO EST PM STATE CHOCO INSTALLATION DATE 1943-ENE
LON 7632 W ENT 01 IDEAM CITY LLORO FECHA-SUSPENSION
ELE 100 m.s.n.m REGIONAL 01 ANTIOQUIA CORRIENTE ANDAGUEDA
DAY JAN * FEB * MAR * APR * MAY * JUN * JUL * AGO * SEP * OCT * NOV * DEC *
01 30.0 .0 .0 3.0 80.0 .0 3.0 .0 35.0 88.0 1.0
02 .0 1.0 .0 1.0 100.0 .0 .0 6.0 1.0 65.0 69.0
03 35.0 100.0 .0 10.0 .0 .0 .0 70.0 40.0 42.0 16.0
04 .0 .0 80.0 3.0 140.0 8.0 .0 135.0 20.0 48.0 15.0
05 .0 .0 .0 8.0 3.0 20.0 4.0 19.0 80.0 .0 20.0
06 .0 .0 100.0 138.0 .0 6.0 .0 4.0 20.0 .0 10.0
07 31.0 10.0 .0 30.0 15.0 50.0 6.0 .0 4.0 .0 .0
08 .0 44.0 .0 10.0 40.0 .0 .0 .0 7.0 .0 4.0
09 35.0 3.0 23.0 .0 20.0 140.0 .0 6.0 .0 32.0 16.0
10 .0 75.0 .0 .0 60.0 .0 .0 23.0 3.0 1.0 5.0
11 .0 17.0 .0 15.0 80.0 .0 .0 80.0 .0 .0 3.0
12 .0 75.0 .0 8.0 .0 63.0 10.0 .0 .0 17.0 10.0
13 .0 20.0 .0 60.0 .0 .0 .0 110.0 50.0 3.0 25.0
14 55.0 .0 26.0 12.0 .0 3.0 140.0 4.0 74.0 .0 38.0
15 .0 .0 3.0 7.0 10.0 .0 6.0 .0 35.0 12.0 27.0
16 .0 4.0 89.0 20.0 3.0 .0 .0 10.0 .0 .0 .0
17 45.0 .0 9.0 .0 30.0 .0 2.0 .0 60.0 103.0 .0
18 30.0 .0 .0 .0 21.0 .0 20.0 15.0 .0 .0 .0
19 .0 130.0 .0 10.0 12.0 8.0 .0 3.0 20.0 49.0 40.0
20 45.0 .0 25.0 190.0 .0 38.0 8.0 .0 8.0 3.0 1.0
21 1.0 .0 45.0 50.0 .0 35.0 .0 2.0 13.0 1.0 4.0
22 .0 .0 20.0 .0 .0 .0 .0 16.0 10.0 12.0 50.0
23 40.0 .0 40.0 16.0 .0 30.0 .0 13.0 2.0 106.0 10.0
24 .0 .0 45.0 60.0 .0 3.0 .0 25.0 .0 16.0 .0
25 .0 .0 .0 .0 18.0 10.0 .0 3.0 .0 50.0 20.0
26 10.0 .0 .0 .0 9.0 6.0 20.0 20.0 6.0 15.0 3.0
27 .0 135.0 60.0 40.0 80.0 15.0 .0 18.0 10.0 77.0 .0
28 10.0 .0 9.0 15.0 .0 .0 .0 6.0 72.0 102.0 .0
29 23.0 6.0 .0 .0 .0 .0 .0 23.0 .0 34.0 .0
30 .0 10.0 .0 20.0 3.0 .0 64.0 14.0 111.0 .0
31 .0 31.0 10.0 .0 .0 .0
*** ANNUAL VALUES ***
TOTAL 6954.0
No DE RAIN DAYS 210
MAX 24 Hrs 190.0
文本文件包含一个接一个的报告,所有报告都具有相同的标题"I D E A M - INSTITUTO DE HIDROLOGIA, METEOROLOGIA Y ESTUDIOS AMBIENTALES"
。我已经"阅读"使用readLines()
函数的文本文件,我希望创建一个包含每个报告信息的数据框,如下所示:
DATE STATION_ID LAT LON ELE CITY STATE PRECIPITATION
01/JAN/1980 11010010 0527 N 7632 W 100 LLORO CHOCO 0
我一直在尝试拆分每个报告,然后开始解析每一行。不幸的是一个缓慢的过程。我理解这个页面会查找分隔的问题,但我有点卡住了。
提前致谢。
答案 0 :(得分:1)
这是一种方法。
readLines()
阅读整页,56行。substr()
cbind
带有标题信息。14 + dayOfMonth
,水平偏移量可以是包含12个数字的向量,每月一个。将该列添加到您的页面。如果您在浏览每个页面rbind
时,最终会得到一个长(!)整洁的数据集。 [edit]如果你的数据集很大,你也会花一些时间来管理内存。相反,您可以创建一个数据框列表,并在最后将它们全部绑定。有关详细信息,请参阅this question和this question。
以下是我提出的一些代码:您可以先在简短的提取中测试它。
library("lubridate")
raw2page <- function(rawdata) {
# Takes a vector of chars, one page of data, returns a tidy dataframe
# Template for the page header
yearbound <- c(5,60,63)
stationbound <- c(5,105,112)
latbound <- c(7,16,19)
longbound <- c(8,16,19)
deptobound <- c(7,81,101)
municipiobound <- c(8,81,101)
framebounds <- rbind(yearbound,stationbound,latbound,longbound,deptobound,municipiobound)
colnames(framebounds) <- c("line","start","end")
framebounds <- as.data.frame(framebounds)
framedata <- data.frame()
framedata <- as.data.frame(rbind(with(framebounds, substr(rawdata[line],start,end))))
colnames(framedata) <- c("year","station","latitude","longitude","depto","municipio")
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
framedata$depto <- trim(framedata$depto)
framedata$municipio <- trim(framedata$municipio)
# Make a column listing all dates of the year
st <- as.Date(paste(framedata[1]$year,"-01-01",sep=""))
en <- as.Date(paste(framedata[1]$year,"-12-31",sep=""))
date <- seq(as.Date(st),as.Date(en), by=1)
pagedata <- cbind(framedata,date)
# horizontal offsets for the last digit of each month (the last digit is aligned)
mboundaries<-c(25,34,43,52,61,70,79,88,97,106,115,124)
# now we can take the dates we generated before and use these coordinates to read the rainfall amount into a vector
rainfall <- as.numeric(substr(rawdata[14+mday(pagedata$date)],mboundaries[month(pagedata$date)]-6,mboundaries[month(pagedata$date)] ))
# and bind the vector to the page data to make a tidy data set
page <- cbind(pagedata,rainfall)
page
}
raw <- readLines("area1.txt") # read in all the data
# Get all the page header line numbers
headers <- as.data.frame(grep("HIDROLOGIA", raw))
colnames(headers) <- c("linenum")
listOfDataFrames <- vector(mode = "list", length = nrow(headers))
# page by page, append onto the list
output <- data.frame()
for (i in 1:nrow(headers)) {
start <- headers[i,]
end <- start + 56
listOfDataFrames[[i]] <- raw2page(raw[start:end])
}
library("plyr")
output <- rbind.fill(listOfDataFrames)
print(summary(output))