我有一个不整齐的数据集,它在两列中的每一列中组合了两个变量(一些缺失)(下面数据框'test'中的一个小子样本)。我正努力在下面创建所需的整洁数据集。
untidy <- structure(list(`N [ears]` = c("173", "60", "54 [96]", "168 [328]",
"906 [1685]"), `% Otorrhea` = c("58.61%", "13.30%", "11.11%",
"52.38%", "14.79% [10.45%]")), .Names = c("N [ears]", "% Otorrhea"
), row.names = c(NA, 5L), class = "data.frame")
所需数据框
N_patients N_ears pct_patients pct_ears
173 NA 58.61 NA
60 NA 13.30 NA
54 96 11.11 NA
168 328 14.79 10.45
谢谢!
似乎总有一个边缘案例 - 两个答案都没有考虑第5行的问题。似乎只是一个正则表达式问题。关于如何解决的建议?
untidy_2 <- structure(list(`N [ears]` = c("173", "60", "54 [96]", "168 [328]",
"906 [1685]"), `% Otorrhea` = c("58.61%", "13.30%", "11.11%",
"52.38%", "14.79% [10.45%]")), .Names = c("N [ears]", "% Otorrhea"
), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))
即。第5行,[35.55%]被解析为pct_patients
N [ears] % Otorrhea N_patients N_ears pct_patients pct_ears
1 173 58.61% 173 NA 58.61 NA
2 60 13.30% 60 NA 13.30 NA
3 54 [96] 11.11% 54 96 11.11 NA
4 168 [328] 52.38% 168 328 52.38 NA
5 75 [150] [35.33%] 75 150 35.33 NA
答案 0 :(得分:2)
令人高兴的是,使用tidyr
中的tidyverse
包非常容易。
library(tidyverse)
test <- structure(list(`N [ears]` = c("173", "60", "54 [96]", "168 [328]", "906 [1685]"),
`% Otorrhea` = c("58.61%", "13.30%", "11.11%", "52.38%", "14.79% [10.45%]")),
Names = c("N [ears]", "% Otorrhea"),
row.names = c(NA, 5L), class = "data.frame")
test %>%
separate(`N [ears]`, into = c("N_patients", "N_ears"), sep = "\\s\\[", fill = "right") %>%
separate(`% Otorrhea`, into = c("pct_patients", "pct_ears"), sep = "\\s\\[", fill = "right") %>%
mutate_each(funs(parse_number))
#> N_patients N_ears pct_patients pct_ears
#> 1 173 NA 58.61 NA
#> 2 60 NA 13.30 NA
#> 3 54 96 11.11 NA
#> 4 168 328 52.38 NA
#> 5 906 1685 14.79 10.45
答案 1 :(得分:1)
以下是具有正则表达式的extract()
函数的替代方法:
library(tidyr)
test %>%
extract(`N [ears]`, into = c("N_patients", "N_ears"),
regex = "^(\\d+)(?:\\s\\[(\\d+)\\])?$") %>%
extract(`% Otorrhea`, into = c("pct_patients", "pct_ears"),
regex = "^([.0-9]+)%(?:\\s\\[([.0-9]+)%\\])?$")
# N_patients N_ears pct_patients pct_ears
#1 173 <NA> 58.61 <NA>
#2 60 <NA> 13.30 <NA>
#3 54 96 11.11 <NA>
#4 168 328 52.38 <NA>
#5 906 1685 14.79 10.45
在这里,我们可以使用非捕获组(?:...)
和?
来捕获可选的耳朵列。
答案 2 :(得分:0)
我的实际数据集的最佳答案由评论提供 https://stackoverflow.com/users/4497050/alistaire
如下图所示,包含在一个简单的功能中。
library(tidyverse)
make_tidy <- function(untidy){
tidy <- untidy %>%
separate_(colnames(untidy)[1], c('N_patients', 'N_ears'), fill = 'right', extra = 'drop', convert = TRUE) %>%
separate_(colnames(untidy)[2], c('pct_patients', 'pct_ears'), sep = '[^\\d.]+', extra = 'drop', convert = TRUE)
}
tidy_2 <- make_tidy(untidy_2)
正确解析untidy_2
> tidy_2
# A tibble: 5 × 4
N_patients N_ears pct_patients pct_ears
* <int> <int> <dbl> <dbl>
1 173 NA 58.61 NA
2 60 NA 13.30 NA
3 54 96 11.11 NA
4 168 328 52.38 NA
5 906 1685 14.79 10.45