与此 previous question 类似,我正在尝试将向量转换为 R 中的数据框。我使用 this trick 将其转换为矩阵,然后转换为数据框,但问题是某些行可能有不同数量的列,这会抛出我的数据框。每行可以有任意数量的值(即不一定是示例中的 3 列),因此我首先检查以确定我需要多少列。
例如,给出下面的示例数据,我得到了一个整洁的数据框。
example <- c(
"col-a",
"col-b",
"col-c",
"col-a",
"col-b",
"col-c",
"col-a",
"col-b",
"col-c")
# Get the number of values between the repeating start == number of columns
ncols <- diff(grep("col-a", example))
data.frame(matrix(example, ncol = ncols[1], byrow = T))
# X1 X2 X3
# 1 col-a col-b col-c
# 2 col-a col-b col-c
# 3 col-a col-b col-c
这一切都很好,直到我得到一个在一行中有一个额外值的向量(即需要和额外的列)。例如:
example <- c("col-a",
"col-b",
"col-c",
"col-a",
"col-b",
"col-c",
"WATCH OUT!",
"col-a",
"col-b",
"col-c")
# Get the number of values between the repeating start == number of columns
ncols <- diff(grep("col-a", example))
data.frame(matrix(example, ncol = ncols[1], byrow = T))
# X1 X2 X3
# 1 col-a col-b col-c
# 2 col-a col-b col-c
# 3 WATCH OUT! col-a col-b
# 4 col-c col-a col-b
然而,我真正想要的是:
# X1 X2 X3 X4
# 1 col-a col-b col-c NA
# 2 col-a col-b col-c WATCH OUT!
# 3 col-a col-b col-c NA
在检查第一列元素之间是否存在奇数个元素之后,我可以使用双循环来处理这个问题,但这肯定不会接近最佳选择。
额外的复杂性是“额外”列可能在任何地方,不一定是最后一列。
编辑:列的顺序实际上是任意的,所以没有理由为什么额外的列必须在中间,它可以附加在最后。这是我考虑的一种选择,将其拉出并在之后用 NA
填充后附加它。应该在同一列中的文本也被分隔,因此很清楚它们所属的位置。已更新以下示例。
以下是一些更现实的示例数据和所需的输出:
example <- c("name:start",
"date:a",
"value:b",
"name:start",
"date:c",
"desc:WATCH OUT!",
"value:d",
"name:start",
"date:e",
"value:f")
# Desired output
X1 X2 X3 X4
1 name:start date:a NA value:b
2 name:start date:c desc:WATCH OUT! value:d
3 name:start date:e NA value:f
处理这个问题的最快方法是什么?
提前致谢!
编辑:变成行的“块”是明确定义的,所以块的开始和结束很清楚,找到块的大小并不难,因此我的{{ 1}} 命令(也可以使用 diff(grep(...))
获得类似的结果)。小心!文本可以是任意的,所以它不像搜索 WATCH OUT! 那样简单。
答案 0 :(得分:1)
这个有用吗?
library(tidyverse)
library(rebus)
#>
#> Attaching package: 'rebus'
#> The following object is masked from 'package:stringr':
#>
#> regex
#> The following object is masked from 'package:ggplot2':
#>
#> alpha
example <- c("name:start",
"date:a",
"value:b",
"name:start",
"date:c",
"desc:WATCH OUT!",
"value:d",
"name:start",
"date:e",
"value:f")
example_dirty <- example #i will use it at the end of the script for replacing
custom_pattern <- rebus::or('name:.*', 'date:.', 'value:.')
alien_text_index <- str_detect(example, pattern = custom_pattern) %>%
as.character()
replacement <- which(alien_text_index == 'FALSE') %>%
`/`(., 3) %>% #in this case every three rows the repetition should start over.
round() #round for getting an index to modify
example <- str_match(example , pattern = custom_pattern) %>% keep(~!is.na(.))
df <- c('name:.*', 'date:.', 'value:.') %>%
map(~example[str_detect(example, .x)]) %>% reduce(bind_cols) %>%
mutate(..4 = '')
#> New names:
#> * NA -> ...1
#> * NA -> ...2
#> New names:
#> * NA -> ...3
for (i in length(replacement)) {
df[replacement[i], 4] <- example_dirty[!as.logical(alien_text_index)][i]
}
df
#> # A tibble: 3 x 4
#> ...1 ...2 ...3 ..4
#> <chr> <chr> <chr> <chr>
#> 1 name:start date:a value:b ""
#> 2 name:start date:c value:d "desc:WATCH OUT!"
#> 3 name:start date:e value:f ""
由 reprex package (v2.0.0) 于 2021 年 5 月 29 日创建
答案 1 :(得分:1)
我不确定这种格式的输出是否有用
example <- c("name:start",
"date:a",
"value:b",
"name:start",
"date:c",
"desc:WATCH OUT!",
"value:d",
"name:start",
"date:e",
"value:f")
library(tidyverse)
example %>% as.data.frame() %>% setNames('dummy') %>%
separate(dummy, into=c("name", 'value'), sep = '\\:') %>%
mutate(rowid = cumsum(name == first(name))) %>%
pivot_wider(id_cols = rowid, names_from = name, values_from = value)
#> # A tibble: 3 x 5
#> rowid name date value desc
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 start a b <NA>
#> 2 2 start c d WATCH OUT!
#> 3 3 start e f <NA>
或者这个?
library(tidyverse)
example %>% as.data.frame() %>% setNames('dummy') %>%
separate(dummy, into=c("name", 'value'), sep = '\\:', remove = F) %>%
mutate(rowid = cumsum(name == first(name))) %>%
pivot_wider(id_cols = rowid, names_from = name, values_from = dummy)
#> # A tibble: 3 x 5
#> rowid name date value desc
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 name:start date:a value:b <NA>
#> 2 2 name:start date:c value:d desc:WATCH OUT!
#> 3 3 name:start date:e value:f <NA>
由 reprex package (v2.0.0) 于 2021 年 5 月 30 日创建
对于你的第一个例子,你可以这样做
``` r
example <- c("col-a",
"col-b",
"col-c",
"col-a",
"col-b",
"col-c",
"WATCH OUT!",
"col-a",
"col-b",
"col-c")
library(tidyverse)
example %>% as.data.frame() %>% setNames('dummy') %>%
group_by(rowid = cumsum(dummy == first(dummy))) %>%
mutate(name = paste0('X', row_number())) %>%
pivot_wider(id_cols = rowid, names_from = name, values_from = dummy)
#> # A tibble: 3 x 5
#> # Groups: rowid [3]
#> rowid X1 X2 X3 X4
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 col-a col-b col-c <NA>
#> 2 2 col-a col-b col-c WATCH OUT!
#> 3 3 col-a col-b col-c <NA>
由 reprex package (v2.0.0) 于 2021 年 5 月 30 日创建