多组正则表达式

时间:2019-09-05 03:43:42

标签: r regex

此处的文本挖掘新手试图提取各种字符并更新一列。我试过使用str_extract,但似乎无法获得正则表达式语法。有人可以告诉我吗?谢谢!

可复制数据

data.frame("name" = c("D1. Hi my name", "A3.3. Hello this is"), "Amount" = c(1, 4))

        name            Amount
1     D1. Hi my name      1
2 A3.3. Hello this is     4

预期产量

        name           Amount New Name       Extracted
1     D1. Hi my name      1     Hi my name      D1.
2 A3.3. Hello this is     4    Hello this is    A3.3.

3 个答案:

答案 0 :(得分:2)

我们可以使用extract中的tidyr。在这里,我们通过匹配非空格(\\S+)的模式和后面的空格来捕获,并捕获第二组字符

library(tidyverse)
df2 %>% 
    extract(name, into = c("Extracted", "NewName"), "^(\\S+) (.*)", 
             remove = FALSE) %>%
     select(names(df1),NewName, Extracted)
#               name Amount       NewName Extracted
#1      D1. Hi my name      1    Hi my name       D1.
#2 A3.3. Hello this is      4 Hello this is     A3.3.

或者使用base R,我们可以使用sub创建一个定界符,然后使用read.csv

cbind(df2, read.csv(text = sub("\\s", ",", df2$name), 
           header = FALSE, col.names = c("Extracted", "NewName")))

答案 1 :(得分:1)

根据所示示例,我们可以提取字母后跟数字以得到Extracted,并删除相同部分以得到New_Name

library(dplyr)
library(stringr)

df %>%
  mutate(Extracted = str_extract(name, "[A-Z]\\d\\.?\\d?\\."), 
         New_Name = str_remove(name, Extracted))

#                 name Amount Extracted       New_Name
#1      D1. Hi my name      1       D1.     Hi my name
#2 A3.3. Hello this is      4     A3.3.  Hello this is

还可以将其集成到tidyr::extract

tidyr::extract(df, name, into = c("Extracted", "New_Name"), 
         regex = "([A-Z]\\d\\.?\\d?\\.)(.*)", remove = FALSE)

答案 2 :(得分:0)

上面的第一个答案可能有错误。如果不先将数据转换为小标题,则无法使用Jupyter Lab复制该答案。

提供的原始数据是:

> data.frame("name" = c("D1. Hi my name", "A3.3. Hello this is"),
> "Amount" = c(1, 4))

上面的答案显示:

> df %>%   mutate(Extracted = str_extract(name, "[A-Z]\\d\\.?\\d?\\."), 
>          New_Name = str_remove(name, "[A-Z]\\d\\.?\\d?\\."))

但是带有正则表达式的mutate函数(如此处所示)会产生错误,并且不会提供请求的输出,除非首先将df转换为小标题。

可以在Jupyter中复制并提供所需输出的解决方案如下:

> df <- tibble("name" = c("D1. Hi my name", "A3.3. Hello this is"),
> > "Amount" = c(1, 4))

发生小动作时,mutate和regex将执行并提供请求的输出。

>                  A tibble: 2 × 4
> <chr>                <dbl>      <chr>       <chr>
> name                 Amount     Extracted   New_Name   
> D1.Hi my name        1          D1.         Hi my name  
> A3.3. Hello this is  4          A3.3.       Hello this is