Question

我遇到了一个问题，因为我需要读取一些结构奇怪的数据文件，而且我不知道如何读取它们。数据由两个标头组成，第一个标头从第四列开始。
每列的值都是数字，除了四行以外，这些行的字符串用空格隔开（我无法修改数据，因为我仅对数据具有读取权限）。我需要读取值，我不在乎是否省略标题上的名称，或者字符串是否根据消息的类型获取值，可以是四种类型。能够读取所选列的值就可以了，即使列没有名称也是如此。

这是我要读取哪种文件的示例，它们是.dat文件：

                                  B1         B1              B1               B1                   B1                 B1                     B2         B2              B2                       B2                   B2                   B2         
  Year  Month  Day  Hour  Min   Number1   Number2         Number3           Message             Number4            Message2                Number1   Number2          Number3                 Message              Number4              Message2  
  2019    4     9    8    53     3.29      46.31           0.03      There are no problems         1        There are no problems           3.00       2.00            0.00                                           1          There are no problems       
  2019    4     9    8    54     3.19      46.17           0.03      There are no problems         1        There are two problems          3.00       2.00            0.00             There are no problems         1          There are no problems  
  2019    4     9    8    55     3.15      46.17           0.03      There are no problems         1                                        3.00       3.92            0.00             There are no problems         1          There are three problems

我在这里找到了如何读取带有空格的数据文件的解决方案：How to read a character-string in a column of a data-set，但是从四列开始的两个标头形式，我不知道该怎么做...
任何帮助将不胜感激。

Answer 1

您在正确的轨道上。 read_lines()有一个名为skip的参数。这样您就可以跳过第一行。您会收到警告，因为列名不是唯一的，但似乎您不太在乎;-）

因此，请根据您已经发现的内容（https://stackoverflow.com/a/56238232/1842673）：

library(readr)
library(dplyr)
fname <- 'sample.txt'
write_file("                                B1         B1              B1               B1                   B1                 B1                     B2         B2              B2                       B2                   B2                   B2         
  Year  Month  Day  Hour  Min   Number1   Number2         Number3           Message             Number4            Message2                Number1   Number2          Number3                 Message              Number4              Message2  
  2019    4     9    8    53     3.29      46.31           0.03      There are no problems         1        There are no problems           3.00       2.00            0.00                                           1          There are no problems       
  2019    4     9    8    54     3.19      46.17           0.03      There are no problems         1        There are two problems          3.00       2.00            0.00             There are no problems         1          There are no problems  
  2019    4     9    8    55     3.15      46.17           0.03      There are no problems         1                                        3.00       3.92            0.00             There are no problems         1          There are three problems  "
 ,
  fname
)

hdr <- read_lines(fname,n_max = 1,skip=1) #skips over the first line
cnames <- hdr %>%
  trimws()%>%
  strsplit('\\s+')%>%
  unlist()

m <- gregexpr('\\S(?=\\s|$)',hdr,perl = T) # Find end position of columns
epos <-unlist(m)
spos <- lag(epos+1,1,default = 1)

read_fwf(fname,fwf_positions(start = spos,end = epos,col_names = cnames),skip = 1)

如何用两个标头和变量用空格分隔的R读取数据

1 个答案: