我正在尝试根据分隔符将数据框列拆分为多个列。我的数据框有一列,如下所示 -
A0017493 .A 11.86 23:59_10/10/2016 1.00 SURVEYED
A0017493 .A 11.86 23:59_10/11/2016 1.15 DATALOGGER CHANGED
A0017496 .A 11.82 23:59_11/12/2016 2.06 READING IS WRONG
我想要一个包含6列的数据框,即Site,File,Variable,Timestamp,Value和Comment,如下所示 -
Site File Variable Timestamp Value Comment
A0017493 .A 11.86 23:59_10/10/2016 1.00 SURVEYED
A0017493 .A 11.86 23:59_10/11/2016 1.15 DATALOGGER CHANGED
A0017496 .A 11.82 23:59_11/12/2016 2.06 READING IS WRONG
我尝试使用tidyr包并使用'单独的'声明,因为每个观察都是由空格分隔的。但是,问题是评论之间有空格,我不想分开评论。有没有办法做到这一点?任何帮助将不胜感激。谢谢!
答案 0 :(得分:3)
另一个tidyverse
答案,这次是使用tidyr::separate
。
我们注意到除了最后一行(可以包含空格)之外,每一行都是空格分隔的。在这种情况下,我们可以将空间分割为我们知道的列数。
tidyr::separate
采用可以处理此用例的extra
参数:extra = "merge"
。
library(tidyverse)
data.raw = "A0017493 .A 11.86 23:59_10/10/2016 1.00 SURVEYED
A0017493 .A 11.86 23:59_10/11/2016 1.15 DATALOGGER CHANGED
A0017496 .A 11.82 23:59_11/12/2016 2.06 READING IS WRONG"
data = read_csv(data.raw, col_names = "Col1")
data %>%
separate(Col1, into = c("Site", "File", "Variable", "Timestamp", "Value", "Comment"), sep = "\\s", extra = "merge") %>%
type_convert() %>%
head()
#> # A tibble: 3 x 6
#> Site File Variable Timestamp Value Comment
#> <chr> <chr> <dbl> <chr> <dbl> <chr>
#> 1 A0017493 .A 11.86 23:59_10/10/2016 1.00 SURVEYED
#> 2 A0017493 .A 11.86 23:59_10/11/2016 1.15 DATALOGGER CHANGED
#> 3 A0017496 .A 11.82 23:59_11/12/2016 2.06 READING IS WRONG
答案 1 :(得分:3)
似乎是一个参差不齐的固定宽度格式的文件,所以
library(readr)
pos <- fwf_positions(start = c(1, 9, 13, 19, 36, 42), end = c(9, 13, 19, 36, 42, NA)-2) # if I counted correctly...
df <- read_fwf(file = "A0017493 .A 11.86 23:59_10/10/2016 1.00 SURVEYED
A0017493 .A 11.86 23:59_10/11/2016 1.15 DATALOGGER CHANGED
A0017496 .A 11.82 23:59_11/12/2016 2.06 READING IS WRONG", col_positions = pos )
glimpse(df)
# Observations: 3
# Variables: 6
# $ X1 <chr> "A001749", "A001749", "A001749"
# $ X2 <chr> ".A", ".A", ".A"
# $ X3 <dbl> 11.86, 11.86, 11.82
# $ X4 <chr> "23:59_10/10/2016", "23:59_10/11/2016", "23:59_11/12/2016"
# $ X5 <chr> "1.00 SU", "1.15 DA", "2.06 RE"
# $ X6 <chr> "VEYED", "ALOGGER CHANGED", "DING IS WRONG"
答案 2 :(得分:0)
我们可以使用 tidyverse 包来做你想要的。关键是要根据&#39;分开每一行。 &#39;字符,然后将这些注释列重新组合在一起。这假设您的原始数据包含在名为df
的数据框中,该数据框有一个名为V1
的列。
library(tidyverse)
df.new <- strsplit(df$V1, split = ' ') %>% # split each row into a character vector contained in a list
lapply(function(x) data.frame(rbind(x))) %>% # simplify each vector into a character array
rbind.fill %>% # glue together the ragged rows
unite('Comment', -X1:-X5, sep = ' ') %>% # recombine every column that is NOT one of the first 5 (i.e., combine comment columns)
mutate(Comment = gsub(' NA', '', Comment)) %>% # get rid of 'NA' strings
rename(Site = X1, File = X2, Variable = X3, Timestamp = X4, Value = X5) # relabel columns
mutate_all(as.character) %>% type_convert # convert columns to appropriate formats
Site File Variable Timestamp Value Comment
1 A0017493 .A 11.86 23:59_10/10/2016 1.00 SURVEYED
2 A0017493 .A 11.86 23:59_10/11/2016 1.15 DATALOGGER CHANGED
3 A0017496 .A 11.82 23:59_11/12/2016 2.06 READING IS WRONG