从不整齐的数据集中创建整洁的数据集,每列有两个变量和隐式缺失

时间:2016-10-25 23:07:07

标签: r regex tidyr

我有一个不整齐的数据集,它在两列中的每一列中组合了两个变量(一些缺失)(下面数据框'test'中的一个小子样本)。我正努力在下面创建所需的整洁数据集。

untidy <- structure(list(`N [ears]` = c("173", "60", "54 [96]", "168 [328]", 
"906 [1685]"), `% Otorrhea` = c("58.61%", "13.30%", "11.11%", 
"52.38%", "14.79% [10.45%]")), .Names = c("N [ears]", "% Otorrhea"
), row.names = c(NA, 5L), class = "data.frame")

所需数据框

N_patients  N_ears  pct_patients  pct_ears
173         NA      58.61           NA
 60         NA      13.30           NA
 54         96      11.11           NA
168        328      14.79        10.45

谢谢!

似乎总有一个边缘案例 - 两个答案都没有考虑第5行的问题。似乎只是一个正则表达式问题。关于如何解决的建议?

untidy_2 <- structure(list(`N [ears]` = c("173", "60", "54 [96]", "168 [328]", 
                                          "906 [1685]"), `% Otorrhea` = c("58.61%", "13.30%", "11.11%", 
                                                                          "52.38%", "14.79% [10.45%]")), .Names = c("N [ears]", "% Otorrhea"
                                                                          ), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
                                                                          ))

即。第5行,[35.55%]被解析为pct_patients

   N [ears] % Otorrhea N_patients N_ears pct_patients pct_ears
1       173     58.61%        173     NA        58.61       NA
2        60     13.30%         60     NA        13.30       NA
3   54 [96]     11.11%         54     96        11.11       NA
4 168 [328]     52.38%        168    328        52.38       NA
5  75 [150]   [35.33%]         75    150        35.33       NA

3 个答案:

答案 0 :(得分:2)

令人高兴的是,使用tidyr中的tidyverse包非常容易。

library(tidyverse)

test <- structure(list(`N [ears]` = c("173", "60", "54 [96]", "168 [328]", "906 [1685]"), 
                       `% Otorrhea` = c("58.61%", "13.30%", "11.11%", "52.38%", "14.79% [10.45%]")), 
                  Names = c("N [ears]", "% Otorrhea"), 
                  row.names = c(NA, 5L), class = "data.frame")

test %>% 
    separate(`N [ears]`, into = c("N_patients", "N_ears"), sep = "\\s\\[", fill = "right") %>%
    separate(`% Otorrhea`, into = c("pct_patients", "pct_ears"), sep = "\\s\\[", fill = "right") %>%
    mutate_each(funs(parse_number))
#>   N_patients N_ears pct_patients pct_ears
#> 1        173     NA        58.61       NA
#> 2         60     NA        13.30       NA
#> 3         54     96        11.11       NA
#> 4        168    328        52.38       NA
#> 5        906   1685        14.79    10.45

答案 1 :(得分:1)

以下是具有正则表达式的extract()函数的替代方法:

library(tidyr)
test %>% 
        extract(`N [ears]`, into = c("N_patients", "N_ears"), 
                            regex = "^(\\d+)(?:\\s\\[(\\d+)\\])?$") %>% 
        extract(`% Otorrhea`, into = c("pct_patients", "pct_ears"), 
                              regex = "^([.0-9]+)%(?:\\s\\[([.0-9]+)%\\])?$")

#  N_patients N_ears pct_patients pct_ears
#1        173   <NA>        58.61     <NA>
#2         60   <NA>        13.30     <NA>
#3         54     96        11.11     <NA>
#4        168    328        52.38     <NA>
#5        906   1685        14.79    10.45

在这里,我们可以使用非捕获组(?:...)?来捕获可选的耳朵列。

答案 2 :(得分:0)

我的实际数据集的最佳答案由评论提供 https://stackoverflow.com/users/4497050/alistaire

如下图所示,包含在一个简单的功能中。

  library(tidyverse)

    make_tidy <- function(untidy){
       tidy <- untidy %>% 
       separate_(colnames(untidy)[1], c('N_patients', 'N_ears'), fill = 'right', extra = 'drop', convert = TRUE) %>% 
       separate_(colnames(untidy)[2], c('pct_patients', 'pct_ears'), sep = '[^\\d.]+', extra = 'drop', convert = TRUE)
    }

    tidy_2 <- make_tidy(untidy_2)

正确解析untidy_2

> tidy_2
# A tibble: 5 × 4
  N_patients N_ears pct_patients pct_ears
*      <int>  <int>        <dbl>    <dbl>
1        173     NA        58.61       NA
2         60     NA        13.30       NA
3         54     96        11.11       NA
4        168    328        52.38       NA
5        906   1685        14.79    10.45