我刚开始学习正则表达并坚持一个问题。 我得到了一个包含电影奖项信息的数据集。
**Award**
Won 2 Oscars. Another 7 wins & 37 nominations.
6 wins& 30 nominations
5 wins
Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations.
我想在“胜利”和“提名”之前提取数字,并为每个添加两列。例如,对于第一个,win列为6,提名列为37
我使用的模式是
df2$nomination <- gsub(".*win[s]?|[[:punct:]]? | nomination.*", "",df2$Awards)
效果不佳。我不知道如何为“胜利”编写模式。 :( 有人可以帮忙吗?
非常感谢!
答案 0 :(得分:2)
我们可以使用str_extract
来获取带有正则表达式的值
library(stringr)
text <- c("Won 2 Oscars. Another 7 wins & 37 nominations.",
"6 wins& 30 nominations",
"5 wins",
"Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations.")
df <- data.frame(text = text)
df$value1 <- str_extract(string = df$text, "\\d+\\b(?=\\swin)")
df$value2 <- str_extract(string = df$text, "\\d+\\b(?=\\snomination)")
> df
text value1 value2
1 Won 2 Oscars. Another 7 wins & 37 nominations. 7 37
2 6 wins& 30 nominations 6 30
3 5 wins 5 <NA>
4 Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations. 1 3
答案 1 :(得分:0)
我们可以在填充NA之后在list
然后rbind
中提取数字,以了解只有一个元素的情况
lst <- regmatches(df2$Award, gregexpr("\\d+(?= \\b(wins?|nominations)\\b)",
df2$Award, perl = TRUE))
df2[c('new1', 'new2')] <- do.call(rbind, lapply(lapply(lst, `length<-`,
max(lengths(lst))), as.numeric))
df2
# Award new1 new2
#1 Won 2 Oscars. Another 7 wins & 37 nominations. 7 37
#2 6 wins& 30 nominations 6 30
#3 5 wins 5 NA
#4 Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations. 1 3