此处的文本挖掘新手试图提取各种字符并更新一列。我试过使用str_extract,但似乎无法获得正则表达式语法。有人可以告诉我吗?谢谢!
可复制数据
data.frame("name" = c("D1. Hi my name", "A3.3. Hello this is"), "Amount" = c(1, 4))
name Amount
1 D1. Hi my name 1
2 A3.3. Hello this is 4
预期产量
name Amount New Name Extracted
1 D1. Hi my name 1 Hi my name D1.
2 A3.3. Hello this is 4 Hello this is A3.3.
答案 0 :(得分:2)
我们可以使用extract
中的tidyr
。在这里,我们通过匹配非空格(\\S+
)的模式和后面的空格来捕获,并捕获第二组字符
library(tidyverse)
df2 %>%
extract(name, into = c("Extracted", "NewName"), "^(\\S+) (.*)",
remove = FALSE) %>%
select(names(df1),NewName, Extracted)
# name Amount NewName Extracted
#1 D1. Hi my name 1 Hi my name D1.
#2 A3.3. Hello this is 4 Hello this is A3.3.
或者使用base R
,我们可以使用sub
创建一个定界符,然后使用read.csv
cbind(df2, read.csv(text = sub("\\s", ",", df2$name),
header = FALSE, col.names = c("Extracted", "NewName")))
答案 1 :(得分:1)
根据所示示例,我们可以提取字母后跟数字以得到Extracted
,并删除相同部分以得到New_Name
。
library(dplyr)
library(stringr)
df %>%
mutate(Extracted = str_extract(name, "[A-Z]\\d\\.?\\d?\\."),
New_Name = str_remove(name, Extracted))
# name Amount Extracted New_Name
#1 D1. Hi my name 1 D1. Hi my name
#2 A3.3. Hello this is 4 A3.3. Hello this is
还可以将其集成到tidyr::extract
tidyr::extract(df, name, into = c("Extracted", "New_Name"),
regex = "([A-Z]\\d\\.?\\d?\\.)(.*)", remove = FALSE)
答案 2 :(得分:0)
上面的第一个答案可能有错误。如果不先将数据转换为小标题,则无法使用Jupyter Lab复制该答案。
提供的原始数据是:
> data.frame("name" = c("D1. Hi my name", "A3.3. Hello this is"),
> "Amount" = c(1, 4))
上面的答案显示:
> df %>% mutate(Extracted = str_extract(name, "[A-Z]\\d\\.?\\d?\\."),
> New_Name = str_remove(name, "[A-Z]\\d\\.?\\d?\\."))
但是带有正则表达式的mutate函数(如此处所示)会产生错误,并且不会提供请求的输出,除非首先将df转换为小标题。
可以在Jupyter中复制并提供所需输出的解决方案如下:
> df <- tibble("name" = c("D1. Hi my name", "A3.3. Hello this is"),
> > "Amount" = c(1, 4))
发生小动作时,mutate和regex将执行并提供请求的输出。
> A tibble: 2 × 4
> <chr> <dbl> <chr> <chr>
> name Amount Extracted New_Name
> D1.Hi my name 1 D1. Hi my name
> A3.3. Hello this is 4 A3.3. Hello this is