我正在尝试打开和清理R中的大量海洋学数据集,在该数据集中,观测站信息散布在观测数据块之间,作为标题:
$
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999.0 -9 -9 -9 -9 4868.8 2017 0 7114
2.0 6.0297 35.0199 34.4101 2.0 11111
3.0 6.0279 35.0201 34.4091 3.0 11111
4.0 6.0272 35.0203 34.4091 4.0 11111
5.0 6.0273 35.0204 34.4097 4.9 11111
6.0 6.0274 35.0205 34.4104 5.9 11111
$
2008 1 777 8 17 12 7 25 78.4738 8.3510 27 6 4.1 -999.0 3 7 2 0 4903.8 1570 0 7114
3.0 6.4129 34.5637 34.3541 3.0 11111
4.0 6.4349 34.5748 34.3844 4.0 11111
5.0 6.4803 34.5932 34.4426 4.9 11111
6.0 6.4139 34.5624 34.3552 5.9 11111
7.0 6.5079 34.6097 34.4834 6.9 11111
每个$
之后是一行,其中包含站点数据(例如,年,...,lat,lon,日期,时间),然后跟随几行,其中包含在该站点采样的观测值(例如,深度,温度,盐度等)。
我想将测站数据添加到观测中,以便每个变量都是一列 每个观察结果都是一行,就像这样:
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999 2 6.0297 35.0199 34.4101 2 11111
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999 3 6.0279 35.0201 34.4091 3 11111
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999 4 6.0272 35.0203 34.4091 4 11111
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999 5 6.0273 35.0204 34.4097 4.9 11111
2008 1 774 8 17 5 11 2 78.4952 6.0375 30 7 1.2 -999 6 6.0274 35.0205 34.4104 5.9 11111
2008 1 777 8 17 12 7 25 78.4738 8.351 27 6 4.1 -999 3 6.4129 34.5637 34.3541 3 11111
2008 1 777 8 17 12 7 25 78.4738 8.351 27 6 4.1 -999 4 6.4349 34.5748 34.3844 4 11111
2008 1 777 8 17 12 7 25 78.4738 8.351 27 6 4.1 -999 5 6.4803 34.5932 34.4426 4.9 11111
2008 1 777 8 17 12 7 25 78.4738 8.351 27 6 4.1 -999 6 6.4139 34.5624 34.3552 5.9 11111
2008 1 777 8 17 12 7 25 78.4738 8.351 27 6 4.1 -999 7 6.5079 34.6097 34.4834 6.9 11111
答案 0 :(得分:2)
此解决方案涉及很多,并且依赖于多个Tidyverse库和功能的知识。我不确定它是否可以满足您的需求,但是对于您发布的示例确实可以。但是,我认为折叠块,创建函数以解析较小的块,然后展开结果的方法非常有用。
第一部分涉及找到“ $”标记,将以下几行分组在一起,然后将数据块“嵌套”在一起。然后,我们有一个只有几行的数据框-每节一个。
library(tidyverse)
txt_lns <- readLines("ocean-sample.txt")
txt <- tibble(txt = txt_lns)
# Start by finding new sections, and nesting the data
nested_txt <- txt %>%
mutate(row_number = row_number()) %>%
mutate(new_section = str_detect(txt, "\\$")) %>% # Mark new sections
mutate(starting = ifelse(new_section, row_number, NA)) %>% # Index with row num
tidyr::fill(starting) %>% # Fill index down
# where missing
select(-new_section) %>% # Clean up
filter(!str_detect(txt, "\\$")) %>%
nest(data = c(txt, row_number)) # "Nest" the data
# Take a quick look
nested_txt
然后,我们需要能够处理那些嵌套的块。此处的例程通过识别标头行,然后将字段分为自己的数据帧来解析这些块。在这里,标题行与较短的较小行的逻辑不同。
# Deal with the records within a section
parse_inner_block <- function(x, header_ind) {
if (header_ind) {
df <- x %>%
mutate(txt = str_trim(txt)) %>%
# Separate the header row into 22 variables
separate(txt, into = LETTERS[1:22], sep = "\\s+")
} else {
df <- x %>%
mutate(txt = str_trim(txt)) %>%
# Separate the lesser rows into 6 variables
separate(txt, into = letters[1:6], sep = "\\s+")
}
return(df)
}
parse_outer_block <- function(x) {
df <- x %>%
# Determine if it's a header row with 22 variables or lesser row with 6
mutate(leading_row = (row_number == min(row_number))) %>%
# Fold by header row vs. not
nest(data = c(txt, row_number)) %>%
# Create data frames for both header and lesser rows
mutate(processed = purrr::map2(data, leading_row, parse_inner_block)) %>%
unnest(processed) %>%
# Copy header row values to lesser rows
tidyr::fill(A:V) %>%
# Drop header row
filter(!leading_row)
return(df)
}
然后,我们可以将它们放在一起-从嵌套数据开始,处理每个块,取消嵌套返回的字段,并准备完整的输出。
# Actually put all this together and generate an output dataframe
output <- nested_txt %>%
mutate(proc_out = purrr::map(data, parse_outer_block)) %>%
select(-data) %>%
unnest(proc_out) %>%
select(-starting, -leading_row, -data, -row_number)
output
希望有帮助。我建议您也参考一些purrr
教程,以解决一些类似的问题。
答案 1 :(得分:1)
这更简单,仅取决于基数R。我假设您已经首先使用x <- readLines(....)
读取了文本文件:
start <- which(x == "$") + 1 # Find header indices
rows <- diff(c(start, length(x)+2)) - 2 # Find number of lines per group
# Function to read header and rows and cbind
getdata <- function(begin, end) {
cbind(read.table(text=x[begin]), read.table(text=x[(begin+1):(begin+end)]))
}
dta.list <- lapply(1:(length(start)), function(i) getdata(start[i], rows[i]))
dta.df <- do.call(rbind, dta.list)
这适用于您帖子中包含的两个组。您将需要修复列名,因为在开头和结尾重复了V1-V6。