通过多个条件从因子列创建新列

时间:2018-07-26 14:44:18

标签: r regex dplyr gsub grepl

我想从一个包含多个因子的现有列中创建一个新列,但是其中一部分因子名称会再次出现。让我举例说明:

factorA <- c("paul173643738","paul827484","george39585496","george7848658946","john2354674","john346","ringo384934","ringo24653")
df <- data.frame(factorA)

这是我的尝试:

library(dplyr)
    df <- mutate(
           df,factorB = case_when(
           matches(factorA,"paul.") ~ "paul",
           matches(factorA,"george.") ~ "george",
           matches(factorA,"john.") ~ "john",
           matches(factorA,"ringo.") ~ "ringo",
           TRUE ~ "NA"))

哪个给了我Error in mutate_impl(.data, dots) : Evaluation error: is_string(match) is not TRUE.,我认为这是由于我没有正确指定R应该如何查找我想要的字符串片段而导致的结果。

结果应如下所示:

           factorA  factorB
1    paul173643738  paul
2       paul827484  paul 
3   george39585496  george
4 george7848658946  george
5      john2354674  john
6          john346  john
7      ringo384934  ringo
8       ringo24653  ringo

我确定已经问过这个问题,但是找不到适合我需要的答案。任何帮助将不胜感激。

5 个答案:

答案 0 :(得分:1)

使用stringr

library(stringr)
df %>%
mutate(factorB = case_when(
str_detect(factorA, 'paul') ~ 'paul',
str_detect(factorA,"paul.") ~ "paul",
str_detect(factorA,"george.") ~ "george",
str_detect(factorA,"john.") ~ "john",
str_detect(factorA,"ringo.") ~ "ringo"
))

答案 1 :(得分:1)

您可以使用stringr::str_detect

library(tidyverse)
factorA <- c("paul173643738","paul827484","george39585496","george7848658946","john2354674","john346","ringo384934","ringo24653")
df <- as_data_frame(factorA)
df %>% 
  mutate(factorB = case_when(
    str_detect(factorA, "paul") ~ "paul",
    str_detect(factorA, "george") ~ "george",
    str_detect(factorA, "john") ~ "john",
    str_detect(factorA, "ringo") ~ "ringo"
  ))
#> # A tibble: 8 x 2
#>   value            factorB
#>   <chr>            <chr>  
#> 1 paul173643738    paul   
#> 2 paul827484       paul   
#> 3 george39585496   george 
#> 4 george7848658946 george 
#> 5 john2354674      john   
#> 6 john346          john   
#> 7 ringo384934      ringo  
#> 8 ringo24653       ringo

答案 2 :(得分:1)

如果factorA中指定的字符串格式是固定的,则可以使用gsub提取名称:

only_names <- gsub('(^[A-Za-z]*).*', '\\1', factorA)

答案 3 :(得分:1)

使用R基数sub和正则表达式

> data.frame(factorA, factor8=sub("\\d+", "", factorA))
           factorA factor8
1    paul173643738    paul
2       paul827484    paul
3   george39585496  george
4 george7848658946  george
5      john2354674    john
6          john346    john
7      ringo384934   ringo
8       ringo24653   ringo

答案 4 :(得分:0)

尝试使用extract和仅检测字母的正则表达式。

my.regex <- "([a-z]+)"

df %>% 
  extract(factorA, 
          into = "factorB", 
          regex = my.regex,
          remove = FALSE)

#>            factorA factorB
#> 1    paul173643738    paul
#> 2       paul827484    paul
#> 3   george39585496  george
#> 4 george7848658946  george
#> 5      john2354674    john
#> 6          john346    john
#> 7      ringo384934   ringo
#> 8       ringo24653   ringo

通常,我会追求更干净的数据,但要使用离散值和名称。...

 my.regex <- "([a-z]+)([0-9]+)"        

  df %>% 
  extract(factorA, 
          into = c("factorA", "factorB"), 
          regex = my.regex,
          remove = FALSE)

#>   factorA    factorB
#> 1    paul  173643738
#> 2    paul     827484
#> 3  george   39585496
#> 4  george 7848658946
#> 5    john    2354674
#> 6    john        346
#> 7   ringo     384934
#> 8   ringo      24653
```

reprex package(v0.2.0)于2018-07-28创建。