我遇到了一个问题,因为我需要读取一些结构奇怪的数据文件,而且我不知道如何读取它们。数据由两个标头组成,第一个标头从第四列开始。
每列的值都是数字,除了四行以外,这些行的字符串用空格隔开(我无法修改数据,因为我仅对数据具有读取权限)。我需要读取值,我不在乎是否省略标题上的名称,或者字符串是否根据消息的类型获取值,可以是四种类型。能够读取所选列的值就可以了,即使列没有名称也是如此。
这是我要读取哪种文件的示例,它们是.dat
文件:
B1 B1 B1 B1 B1 B1 B2 B2 B2 B2 B2 B2
Year Month Day Hour Min Number1 Number2 Number3 Message Number4 Message2 Number1 Number2 Number3 Message Number4 Message2
2019 4 9 8 53 3.29 46.31 0.03 There are no problems 1 There are no problems 3.00 2.00 0.00 1 There are no problems
2019 4 9 8 54 3.19 46.17 0.03 There are no problems 1 There are two problems 3.00 2.00 0.00 There are no problems 1 There are no problems
2019 4 9 8 55 3.15 46.17 0.03 There are no problems 1 3.00 3.92 0.00 There are no problems 1 There are three problems
我在这里找到了如何读取带有空格的数据文件的解决方案:How to read a character-string in a column of a data-set,但是从四列开始的两个标头形式,我不知道该怎么做...
任何帮助将不胜感激。
答案 0 :(得分:0)
您在正确的轨道上。 read_lines()
有一个名为skip
的参数。这样您就可以跳过第一行。您会收到警告,因为列名不是唯一的,但似乎您不太在乎;-)
因此,请根据您已经发现的内容(https://stackoverflow.com/a/56238232/1842673):
library(readr)
library(dplyr)
fname <- 'sample.txt'
write_file(" B1 B1 B1 B1 B1 B1 B2 B2 B2 B2 B2 B2
Year Month Day Hour Min Number1 Number2 Number3 Message Number4 Message2 Number1 Number2 Number3 Message Number4 Message2
2019 4 9 8 53 3.29 46.31 0.03 There are no problems 1 There are no problems 3.00 2.00 0.00 1 There are no problems
2019 4 9 8 54 3.19 46.17 0.03 There are no problems 1 There are two problems 3.00 2.00 0.00 There are no problems 1 There are no problems
2019 4 9 8 55 3.15 46.17 0.03 There are no problems 1 3.00 3.92 0.00 There are no problems 1 There are three problems "
,
fname
)
hdr <- read_lines(fname,n_max = 1,skip=1) #skips over the first line
cnames <- hdr %>%
trimws()%>%
strsplit('\\s+')%>%
unlist()
m <- gregexpr('\\S(?=\\s|$)',hdr,perl = T) # Find end position of columns
epos <-unlist(m)
spos <- lag(epos+1,1,default = 1)
read_fwf(fname,fwf_positions(start = spos,end = epos,col_names = cnames),skip = 1)