如何用两个标头和变量用空格分隔的R读取数据

时间:2019-07-23 12:23:05

标签: r file import data-manipulation

我遇到了一个问题,因为我需要读取一些结构奇怪的数据文件,而且我不知道如何读取它们。数据由两个标头组成,第一个标头从第四列开始。
每列的值都是数字,除了四行以外,这些行的字符串用空格隔开(我无法修改数据,因为我仅对数据具有读取权限)。我需要读取值,我不在乎是否省略标题上的名称,或者字符串是否根据消息的类型获取值,可以是四种类型。能够读取所选列的值就可以了,即使列没有名称也是如此。

这是我要读取哪种文件的示例,它们是.dat文件:

                                  B1         B1              B1               B1                   B1                 B1                     B2         B2              B2                       B2                   B2                   B2         
  Year  Month  Day  Hour  Min   Number1   Number2         Number3           Message             Number4            Message2                Number1   Number2          Number3                 Message              Number4              Message2  
  2019    4     9    8    53     3.29      46.31           0.03      There are no problems         1        There are no problems           3.00       2.00            0.00                                           1          There are no problems       
  2019    4     9    8    54     3.19      46.17           0.03      There are no problems         1        There are two problems          3.00       2.00            0.00             There are no problems         1          There are no problems  
  2019    4     9    8    55     3.15      46.17           0.03      There are no problems         1                                        3.00       3.92            0.00             There are no problems         1          There are three problems  

我在这里找到了如何读取带有空格的数据文件的解决方案:How to read a character-string in a column of a data-set,但是从四列开始的两个标头形式,我不知道该怎么做...
任何帮助将不胜感激。

1 个答案:

答案 0 :(得分:0)

您在正确的轨道上。 read_lines()有一个名为skip的参数。这样您就可以跳过第一行。您会收到警告,因为列名不是唯一的,但似乎您不太在乎;-)

因此,请根据您已经发现的内容(https://stackoverflow.com/a/56238232/1842673):

library(readr)
library(dplyr)
fname <- 'sample.txt'
write_file("                                B1         B1              B1               B1                   B1                 B1                     B2         B2              B2                       B2                   B2                   B2         
  Year  Month  Day  Hour  Min   Number1   Number2         Number3           Message             Number4            Message2                Number1   Number2          Number3                 Message              Number4              Message2  
  2019    4     9    8    53     3.29      46.31           0.03      There are no problems         1        There are no problems           3.00       2.00            0.00                                           1          There are no problems       
  2019    4     9    8    54     3.19      46.17           0.03      There are no problems         1        There are two problems          3.00       2.00            0.00             There are no problems         1          There are no problems  
  2019    4     9    8    55     3.15      46.17           0.03      There are no problems         1                                        3.00       3.92            0.00             There are no problems         1          There are three problems  "
 ,
  fname
)

hdr <- read_lines(fname,n_max = 1,skip=1) #skips over the first line
cnames <- hdr %>%
  trimws()%>%
  strsplit('\\s+')%>%
  unlist()

m <- gregexpr('\\S(?=\\s|$)',hdr,perl = T) # Find end position of columns
epos <-unlist(m)
spos <- lag(epos+1,1,default = 1)

read_fwf(fname,fwf_positions(start = spos,end = epos,col_names = cnames),skip = 1)