正则表达式模式 - 在特定单词-gsub之前获取数字

时间:2017-10-14 04:32:28

标签: r regex gsub

我刚开始学习正则表达并坚持一个问题。 我得到了一个包含电影奖项信息的数据集。

**Award** 
    Won 2 Oscars. Another 7 wins & 37 nominations.
    6 wins& 30 nominations
    5 wins
    Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations.

我想在“胜利”和“提名”之前提取数字,并为每个添加两列。例如,对于第一个,win列为6,提名列为37

我使用的模式是

df2$nomination <- gsub(".*win[s]?|[[:punct:]]? | nomination.*", "",df2$Awards)

效果不佳。我不知道如何为“胜利”编写模式。 :( 有人可以帮忙吗?

非常感谢!

2 个答案:

答案 0 :(得分:2)

我们可以使用str_extract来获取带有正则表达式的值

library(stringr)
text <- c("Won 2 Oscars. Another 7 wins & 37 nominations.",
          "6 wins& 30 nominations",
          "5 wins",
          "Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations.")
df <- data.frame(text = text)

df$value1 <- str_extract(string = df$text, "\\d+\\b(?=\\swin)")
df$value2 <- str_extract(string = df$text, "\\d+\\b(?=\\snomination)")

> df
                                                              text value1 value2
1                   Won 2 Oscars. Another 7 wins & 37 nominations.      7     37
2                                           6 wins& 30 nominations      6     30
3                                                           5 wins      5   <NA>
4 Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations.      1      3

答案 1 :(得分:0)

我们可以在填充NA之后在list然后rbind中提取数字,以了解只有一个元素的情况

lst <- regmatches(df2$Award, gregexpr("\\d+(?= \\b(wins?|nominations)\\b)", 
               df2$Award, perl = TRUE))
df2[c('new1', 'new2')] <- do.call(rbind, lapply(lapply(lst, `length<-`, 
                             max(lengths(lst))), as.numeric))
df2
#                                                             Award new1 new2
#1                   Won 2 Oscars. Another 7 wins & 37 nominations.    7   37
#2                                           6 wins& 30 nominations    6   30
#3                                                           5 wins    5   NA
#4 Nominated for 1 BAFTA Film Award. Another 1 win & 3 nominations.    1    3