我想从一个包含多个因子的现有列中创建一个新列,但是其中一部分因子名称会再次出现。让我举例说明:
factorA <- c("paul173643738","paul827484","george39585496","george7848658946","john2354674","john346","ringo384934","ringo24653")
df <- data.frame(factorA)
这是我的尝试:
library(dplyr)
df <- mutate(
df,factorB = case_when(
matches(factorA,"paul.") ~ "paul",
matches(factorA,"george.") ~ "george",
matches(factorA,"john.") ~ "john",
matches(factorA,"ringo.") ~ "ringo",
TRUE ~ "NA"))
哪个给了我Error in mutate_impl(.data, dots) : Evaluation error: is_string(match) is not TRUE.
,我认为这是由于我没有正确指定R
应该如何查找我想要的字符串片段而导致的结果。
结果应如下所示:
factorA factorB
1 paul173643738 paul
2 paul827484 paul
3 george39585496 george
4 george7848658946 george
5 john2354674 john
6 john346 john
7 ringo384934 ringo
8 ringo24653 ringo
我确定已经问过这个问题,但是找不到适合我需要的答案。任何帮助将不胜感激。
答案 0 :(得分:1)
使用stringr
library(stringr)
df %>%
mutate(factorB = case_when(
str_detect(factorA, 'paul') ~ 'paul',
str_detect(factorA,"paul.") ~ "paul",
str_detect(factorA,"george.") ~ "george",
str_detect(factorA,"john.") ~ "john",
str_detect(factorA,"ringo.") ~ "ringo"
))
答案 1 :(得分:1)
您可以使用stringr::str_detect
:
library(tidyverse)
factorA <- c("paul173643738","paul827484","george39585496","george7848658946","john2354674","john346","ringo384934","ringo24653")
df <- as_data_frame(factorA)
df %>%
mutate(factorB = case_when(
str_detect(factorA, "paul") ~ "paul",
str_detect(factorA, "george") ~ "george",
str_detect(factorA, "john") ~ "john",
str_detect(factorA, "ringo") ~ "ringo"
))
#> # A tibble: 8 x 2
#> value factorB
#> <chr> <chr>
#> 1 paul173643738 paul
#> 2 paul827484 paul
#> 3 george39585496 george
#> 4 george7848658946 george
#> 5 john2354674 john
#> 6 john346 john
#> 7 ringo384934 ringo
#> 8 ringo24653 ringo
答案 2 :(得分:1)
如果factorA
中指定的字符串格式是固定的,则可以使用gsub
提取名称:
only_names <- gsub('(^[A-Za-z]*).*', '\\1', factorA)
答案 3 :(得分:1)
使用R基数sub
和正则表达式
> data.frame(factorA, factor8=sub("\\d+", "", factorA))
factorA factor8
1 paul173643738 paul
2 paul827484 paul
3 george39585496 george
4 george7848658946 george
5 john2354674 john
6 john346 john
7 ringo384934 ringo
8 ringo24653 ringo
答案 4 :(得分:0)
尝试使用extract
和仅检测字母的正则表达式。
my.regex <- "([a-z]+)"
df %>%
extract(factorA,
into = "factorB",
regex = my.regex,
remove = FALSE)
#> factorA factorB
#> 1 paul173643738 paul
#> 2 paul827484 paul
#> 3 george39585496 george
#> 4 george7848658946 george
#> 5 john2354674 john
#> 6 john346 john
#> 7 ringo384934 ringo
#> 8 ringo24653 ringo
通常,我会追求更干净的数据,但要使用离散值和名称。...
my.regex <- "([a-z]+)([0-9]+)"
df %>%
extract(factorA,
into = c("factorA", "factorB"),
regex = my.regex,
remove = FALSE)
#> factorA factorB
#> 1 paul 173643738
#> 2 paul 827484
#> 3 george 39585496
#> 4 george 7848658946
#> 5 john 2354674
#> 6 john 346
#> 7 ringo 384934
#> 8 ringo 24653
```
由reprex package(v0.2.0)于2018-07-28创建。