通过图案映射数据

时间:2020-08-19 19:44:10

标签: r dplyr

我有长格式的数据。我想将其转换为宽格式。列映射的逻辑-第一列必须具有单词“ bed”,第二列必须具有单词“ m ^ 2”,第三列必须具有单词“ floor”或“ lift”。

Type <- read.table(header = T, text = "
    Attributes
    '2 bed'
    '197 m²'
    'Floor 5 exterior with lift'
    '3 bed'
    'Ground floor exterior with lift'
    '3 bed'
    '110 m²'
    '195 m²'
    'Floor 5 exterior with lift'
    '3 bed'
    '110 m²'
    '5 bed'
    ")


Type2 <- Type %>%
  group_by(grp = cumsum(str_detect(Attributes, '^\\d+\\s*bed$'))) %>% 
  mutate(colnm = c('BedRoom', 'Size', 'Floor')[row_number()]) %>%
  ungroup %>%
  pivot_wider(names_from = colnm, values_from = Attributes) %>%
  select(-grp)

当“床”值不可用时,以上代码不起作用。

所需的输出

enter image description here

1 个答案:

答案 0 :(得分:1)

一个选择是创建一个索引,以使用case_when/str_detect映射帖子中指定的每个模式。然后,基于索引,我们检查重复索引或相邻索引之间的差异小于或等于0的情况,并创建一个逻辑向量累加和的组。使用“ grp”,我们可以使用pivot_wider

将数据直接转换为“宽”格式
library(stringr)
library(dplyr)
library(tidyr)
Type %>%
    mutate(ind = case_when(
             str_detect(Attributes, '\\bbed') ~ 1, 
             str_detect(Attributes, "m²$") ~ 2, 
             str_detect(Attributes, "\\b(Floor|lift)\\b")~ 3), 
         grp =  cumsum(c(TRUE, diff(ind) <= 0)),
    colnm =  c('BedRoom', 'Size', 'Floor')[ind]) %>% 
    select(-ind) %>%
    pivot_wider(names_from = colnm, values_from = Attributes) %>%     
    select(-grp)
# A tibble: 6 x 3
#  BedRoom Size   Floor                          
#  <chr>   <chr>  <chr>                          
#1 2 bed   197 m² Floor 5 exterior with lift     
#2 3 bed   <NA>   Ground floor exterior with lift
#3 3 bed   110 m² <NA>                           
#4 <NA>    195 m² Floor 5 exterior with lift     
#5 3 bed   110 m² <NA>                           
#6 5 bed   <NA>   <NA>