基于字符串过滤和条件的Dplyr转换

时间:2020-03-06 20:06:56

标签: r dplyr tidyverse data-manipulation

我想在R中转换凌乱的数据集,

但是我在解决该问题时遇到了问题,我提供了示例数据集和需要实现的结果:

dataset <- tribble(
  ~ID, ~DESC,
  1, "3+1Â 81Â mÂ", 
  2, "2+1Â 90Â mÂ",
  3, "3+KK 28Â mÂ",
  4, "3+1 120 m (Mezone)")
dataset

dataset_tranformed <- tribble(
  ~ID, ~Rooms, ~Meters, ~Mezone, ~KK,
  1, 4, 81,0, 0,
  2, 3, 90,0,0,
  3, 3, 28,0,1,
  4, 4, 120,1, 0)
dataset_tranformed

首先需要分隔列,但是使用dataset %>% separate(DESC, c("size", "meters_squared", "Mezone"), sep = " ")不起作用,因为会丢弃(Mezone)

1 个答案:

答案 0 :(得分:2)

我们可以通过评估并单独提取成分来实现

library(dplyr)
library(stringr)
library(tidyr)
dataset %>% 
   mutate(Rooms = map_dbl(DESC,  ~
       str_extract(.x, "^\\d+\\+\\d*") %>% 
         str_replace("\\+$", "+0") %>% 
         rlang::parse_expr(.) %>% 
         eval ), 
   Meters = str_extract(DESC, "(?<=\\s)\\d+(?=Â)"),
   Mezone = +(str_detect(DESC, "Mezone")),
   KK = +(str_detect(DESC, "KK"))) %>%
  select(-DESC)
# A tibble: 4 x 5
#     ID Rooms Meters Mezone    KK
#  <dbl> <dbl> <chr>   <int> <int>
#1     1     4 81          0     0
#2     2     3 90          0     0
#3     3     3 28          0     1
#4     4     4 120         1     0

或者另一个选择是extract,然后使用str_detect

dataset %>% 
   extract(DESC, into = c("Rooms1", "Rooms2", "Meters"), 
     "^(\\d+)\\+(\\d*)[^0-9]+(\\d+)", convert = TRUE, remove = FALSE) %>%
   transmute(ID, Mezone = +(str_detect(DESC, "Mezone")),
        KK = +(is.na(Rooms2)), Rooms =  Rooms1 + replace_na(Rooms2, 0), Meters )
# A tibble: 4 x 5
#     ID Mezone    KK Rooms Meters
#  <dbl>  <int> <int> <dbl>  <int>
#1     1      0     0     4     81
#2     2      0     0     3     90
#3     3      0     1     3     28
#4     4      1     0     4    120