将不规则的文本数据文件读入R中

时间:2015-10-01 21:16:28

标签: r text dataframe text-mining plaintext

我正在尝试"导入"来自具有多个降水率报告的非data.frame形状文本文件中的数据。报告都是平等的,一个样本如下:

  I D E A M  -  INSTITUTO DE HIDROLOGIA, METEOROLOGIA Y ESTUDIOS AMBIENTALES
                                                                                                          INFORMATION SYSTEM
                                  PRECIPITATION TOTAL VALUES (mms)                              NATIONAL ENVIRONMENTAL 

DATE OF PROCESS :  2015/09/15                    YEAR  1980                              STATION ID : 11010010  VUELTA LA

LAT    0527 N               TIPO EST    PM                   STATE      CHOCO                   INSTALLATION DATE   1943-ENE
LON   7632 W               ENT     01  IDEAM            CITY  LLORO                   FECHA-SUSPENSION
ELE   100 m.s.n.m         REGIONAL    01  ANTIOQUIA        CORRIENTE  ANDAGUEDA


      DAY       JAN *  FEB *  MAR *  APR *  MAY  *  JUN *  JUL *  AGO *  SEP *  OCT *  NOV *  DEC *


       01                 30.0       .0       .0      3.0     80.0       .0      3.0       .0     35.0     88.0      1.0
       02                   .0      1.0       .0      1.0    100.0       .0       .0      6.0      1.0     65.0     69.0
       03                 35.0    100.0       .0     10.0       .0       .0       .0     70.0     40.0     42.0     16.0
       04                   .0       .0     80.0      3.0    140.0      8.0       .0    135.0     20.0     48.0     15.0
       05                   .0       .0       .0      8.0      3.0     20.0      4.0     19.0     80.0       .0     20.0
       06                   .0       .0    100.0    138.0       .0      6.0       .0      4.0     20.0       .0     10.0
       07                 31.0     10.0       .0     30.0     15.0     50.0      6.0       .0      4.0       .0       .0
       08                   .0     44.0       .0     10.0     40.0       .0       .0       .0      7.0       .0      4.0
       09                 35.0      3.0     23.0       .0     20.0    140.0       .0      6.0       .0     32.0     16.0
       10                   .0     75.0       .0       .0     60.0       .0       .0     23.0      3.0      1.0      5.0
       11                   .0     17.0       .0     15.0     80.0       .0       .0     80.0       .0       .0      3.0
       12                   .0     75.0       .0      8.0       .0     63.0     10.0       .0       .0     17.0     10.0
       13                   .0     20.0       .0     60.0       .0       .0       .0    110.0     50.0      3.0     25.0
       14                 55.0       .0     26.0     12.0       .0      3.0    140.0      4.0     74.0       .0     38.0
       15                   .0       .0      3.0      7.0     10.0       .0      6.0       .0     35.0     12.0     27.0
       16                   .0      4.0     89.0     20.0      3.0       .0       .0     10.0       .0       .0       .0
       17                 45.0       .0      9.0       .0     30.0       .0      2.0       .0     60.0    103.0       .0
       18                 30.0       .0       .0       .0     21.0       .0     20.0     15.0       .0       .0       .0
       19                   .0    130.0       .0     10.0     12.0      8.0       .0      3.0     20.0     49.0     40.0
       20                 45.0       .0     25.0    190.0       .0     38.0      8.0       .0      8.0      3.0      1.0
       21                  1.0       .0     45.0     50.0       .0     35.0       .0      2.0     13.0      1.0      4.0
       22                   .0       .0     20.0       .0       .0       .0       .0     16.0     10.0     12.0     50.0
       23                 40.0       .0     40.0     16.0       .0     30.0       .0     13.0      2.0    106.0     10.0
       24                   .0       .0     45.0     60.0       .0      3.0       .0     25.0       .0     16.0       .0
       25                   .0       .0       .0       .0     18.0     10.0       .0      3.0       .0     50.0     20.0
       26                 10.0       .0       .0       .0      9.0      6.0     20.0     20.0      6.0     15.0      3.0
       27                   .0    135.0     60.0     40.0     80.0     15.0       .0     18.0     10.0     77.0       .0
       28                 10.0       .0      9.0     15.0       .0       .0       .0      6.0     72.0    102.0       .0
       29                 23.0      6.0       .0       .0       .0       .0       .0     23.0       .0     34.0       .0
       30                            .0     10.0       .0     20.0      3.0       .0     64.0     14.0    111.0       .0
       31                            .0              31.0              10.0       .0                .0                .0


                                  ***  ANNUAL VALUES  ***

                                 TOTAL                  6954.0
                                 No DE RAIN DAYS         210
                                 MAX 24 Hrs        190.0

文本文件包含一个接一个的报告,所有报告都具有相同的标题"I D E A M - INSTITUTO DE HIDROLOGIA, METEOROLOGIA Y ESTUDIOS AMBIENTALES"。我已经"阅读"使用readLines()函数的文本文件,我希望创建一个包含每个报告信息的数据框,如下所示:

DATE        STATION_ID  LAT    LON    ELE CITY STATE PRECIPITATION
01/JAN/1980 11010010    0527 N 7632 W 100 LLORO CHOCO 0

我一直在尝试拆分每个报告,然后开始解析每一行。不幸的是一个缓慢的过程。我理解这个页面会查找分隔的问题,但我有点卡住了。

提前致谢。

1 个答案:

答案 0 :(得分:1)

这是一种方法。

  1. 使用readLines()阅读整页,56行。
  2. 通过了解纬度,经度,海拔,城市,州和年份的行中的行号和位置来确定标题中的信息。使用substr()
  3. 使用那里获得的年份,写出当年的所有日期。 cbind带有标题信息。
  4. 使用一个记录月份和月份日期的函数,并在页面上找到相应的降水量。行号为14 + dayOfMonth,水平偏移量可以是包含12个数字的向量,每月一个。将该列添加到您的页面。
  5. 如果您在浏览每个页面rbind时,最终会得到一个长(!)整洁的数据集。 [edit]如果你的数据集很大,你也会花一些时间来管理内存。相反,您可以创建一个数据框列表,并在最后将它们全部绑定。有关详细信息,请参阅this questionthis question

    以下是我提出的一些代码:您可以先在简短的提取中测试它。

    library("lubridate")
    raw2page <- function(rawdata) {
    # Takes a vector of chars, one page of data, returns a tidy dataframe
    # Template for the page header
    yearbound <- c(5,60,63)
    stationbound <- c(5,105,112)
    latbound <- c(7,16,19)
    longbound <- c(8,16,19)
    deptobound <- c(7,81,101)
    municipiobound <- c(8,81,101)
    
    framebounds <- rbind(yearbound,stationbound,latbound,longbound,deptobound,municipiobound)
    colnames(framebounds) <- c("line","start","end")
    framebounds <- as.data.frame(framebounds)
    
    framedata <- data.frame()
    framedata <- as.data.frame(rbind(with(framebounds, substr(rawdata[line],start,end))))
    colnames(framedata) <- c("year","station","latitude","longitude","depto","municipio")
    trim <- function (x) gsub("^\\s+|\\s+$", "", x)
    framedata$depto <- trim(framedata$depto)
    framedata$municipio <- trim(framedata$municipio)
    
    # Make a column listing all dates of the year
    st <- as.Date(paste(framedata[1]$year,"-01-01",sep=""))
    en <- as.Date(paste(framedata[1]$year,"-12-31",sep=""))
    date <- seq(as.Date(st),as.Date(en), by=1)
    pagedata <- cbind(framedata,date)
    
    # horizontal offsets for the last digit of each month (the last digit is aligned)
    mboundaries<-c(25,34,43,52,61,70,79,88,97,106,115,124)
    # now we can take the dates we generated before and use these coordinates to read the rainfall amount into a vector
    rainfall <- as.numeric(substr(rawdata[14+mday(pagedata$date)],mboundaries[month(pagedata$date)]-6,mboundaries[month(pagedata$date)] ))
    # and bind the vector to the page data to make a tidy data set 
    page <- cbind(pagedata,rainfall)
    page
    }
    
    raw <- readLines("area1.txt") # read in all the data
    
    # Get all the page header line numbers
    headers <- as.data.frame(grep("HIDROLOGIA", raw))
    colnames(headers) <- c("linenum")
    
    listOfDataFrames <- vector(mode = "list", length = nrow(headers))
    
    # page by page, append onto the list
    output <- data.frame()
    for (i in 1:nrow(headers)) {
      start <- headers[i,]
      end <- start + 56
      listOfDataFrames[[i]] <- raw2page(raw[start:end])
          }
    library("plyr")
    output <- rbind.fill(listOfDataFrames)
    print(summary(output))