使用R的dplyr将尺寸从多种格式分为小,中,大

时间:2019-03-01 23:01:11

标签: r string dplyr

这是一个示例数据集:

df <- tibble(
  size = c("l", "L/Black", "medium", "small", "large", "L/White", "s", 
       "L/White", "M", "S/Blue", "M/White", "L/Navy", "M/Navy", "S"),
  shirt = c("blue", "black", "black", "black", "white", "white", "purple",
        "white", "purple", "blue", "white", "navy", "navy", "navy")
)

上面的数据集有一个列size,其中显示了基础知识:smallmediumlarge。但是它也具有这些大小的其他表示形式,例如MS/Blues

我想使用最有效的方法制作smallmediumlarge的所有内容,并摆脱size类别中的颜色。例如。将L/Black等于large

我可以多次使用gsub来执行此操作,但是我想知道是否有比我最初的想法更有效的方法。我的数据集有几千行,下面的代码示例很糟糕:

df$size <- df$size %>%
 gsub("M", "medium", .) %>%
 gsub("mediumedium", "medium", .) %>%
 gsub("S", "small", .) %>%
 gsub("smallmall", "small", .) %>%
 gsub("L", "large", .) %>%
 gsub("S/Blue", "small", .) %>%
 gsub("L/Navy", "large", .) 

此方法效果不佳,因为在上面的前两个smallmall中运行时会引入诸如mediumediumgsub之类的东西。标准化所有三种主要尺寸的最佳方法是什么?

2 个答案:

答案 0 :(得分:1)

library("tidyverse")

df %>%
  # Extract the alphanum substring at the start of "size"
  extract(size, "size2", regex = "^(\\w*)", remove = FALSE) %>%
  # All lowercase in case there are sizes like "Small"
  # And then recode as required.
  # Here "l" = "large" means take all occurrences of "l" and
  # recode them as "large", etc.
  mutate(size3 = recode(tolower(size2),
                        "l" = "large",
                        "m" = "medium",
                        "s" = "small"))
# # A tibble: 14 x 4
#   size    size2  shirt  size3
#   <chr>   <chr>  <chr>  <chr>
# 1 l       l      blue   large
# 2 L/Black L      black  large
# 3 medium  medium black  medium
# 4 small   small  black  small
# 5 large   large  white  large

当然,您不需要三个大小列。我使用了不同的列名,这样很明显每个转换都可以实现。

答案 1 :(得分:1)

使用tidyverse的解决方案。

library(tidyverse)

df2 <- df %>%
  # Remove color
  mutate(size = map2_chr(size, shirt, ~str_replace(.x, fixed(.y, ignore_case = TRUE), ""))) %>%
  # Remove /
  mutate(size = str_replace(size, fixed("/"), "")) %>%
  # Replacement
  mutate(size = case_when(
    size %in% "l" | size %in% "L"    ~ "large",
    size %in% "m" | size %in% "M"    ~ "medium",
    size %in% "s" | size %in% "S"    ~ "small",
    TRUE                             ~ size
  ))
df2
# # A tibble: 14 x 2
#    size   shirt 
#    <chr>  <chr> 
#  1 large  blue  
#  2 large  black 
#  3 medium black 
#  4 large  black 
#  5 large  white 
#  6 large  white 
#  7 small  purple
#  8 large  white 
#  9 medium purple
# 10 small  blue  
# 11 medium white 
# 12 large  navy  
# 13 medium navy  
# 14 small  navy