模式与sub()匹配,无法捕获并替换第一次出现

时间:2017-05-29 09:01:27

标签: r regex

以下是我期望的结果

> title = "La La Land (2016/I)"
[1]"(2016" #result
> title = "_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_"
[1]"(2013" #result
> title = "dfajfj(2015)asdfjuwer f(2017)fa.erewr6"
[1]"(2015" #result

=============================================== ===================

以下是我通过应用代码sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")

获得的内容
> title = "_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_"
> sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
[1] "(1500-1800) (#1.1)"  #result. However, I expected it to be "(2013)"
> title = "La La Land (2016/I)"
> sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
[1] "(2016/I)" #result as I expect
> title = "dfajfj(2015)asdfjuwer f(2017)fa.erewr6"
> sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
[1]"(2017)" # result. However, I expect it to be "(2015)"

以下是我应用代码sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")

的内容
> title = "La La Land (2016/I)"
> sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
[1] "La La Land (2016/I)" #result. However, I expect it to be "(2016)"
> title = "dfajfj(2015)asdfjuwer f(2017)fa.erewr6" 
> sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
[1] "(2017)" #result. However, I expect it to be "(2015)"
> title = "_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_"
> sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
[1] "(2013)" #result as I expect

我查看了sub的说明,它说“sub执行第一场比赛的替换。在这种情况下,第一场比赛应该是(2013)

总之,我尝试编写一个sub()命令来返回字符串中第一次出现的一年。

我猜我的代码有问题却无法找到,感谢有人能帮助我。

=============================================== ===================

事实上,我的最终目标是提取所有电影的年份。但是,我不知道如何一步到位。因此,我决定首先以(dddd格式查找年份,然后使用代码sub(pattern="\\((\\d{4}).*", a, replacement="\\1")查找年份的纯数。

例如:

> a= "(2015"
> sub(pattern="\\((\\d{4}).*", a, replacement="\\1")
[1] "2015"
> a= "(2015)"
> sub(pattern="\\((\\d{4}).*", a, replacement="\\1")
[1] "2015"

================= 更新时间05/29/2017 22:51 PM ============== =========

akrun的答案中的str_extract与我的数据集配合得很好。

但是,sub()不适用于所有数据。以下是我的所作所为。但是,我的代码不适用于所有500条记录。如果有人能指出我的代码上的错误,我真的很感激。我真的无法弄清楚自己。非常感谢你。

> t1
[1] "Man Who Fell to Earth (Remix) (2010) (TV)"
> t2
[1] "Manual pr\u0087ctico del amigo imaginario (abreviado) (2008)"
> title = c(t1,t2)
> x=gsub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
> x
[1] "(2010) (TV)" "(2008)"     
> sub(pattern="\\((.*)\\).*", x, replacement="\\1")
[1] "2010) (TV" "2008"     

但是,我的目标是获得20102008。我的代码与t2一起使用,但在t1

时失败

1 个答案:

答案 0 :(得分:1)

我们可以从字符串的开头(()匹配0个或更多不是[^(]*^)的字符,后跟(和4个我们捕获的数字(\\([0-9]{4})((...))后跟其他字符(.*)并替换为捕获组的反向引用(\\1

sub("^[^(]*(\\([0-9]{4}).*", "\\1", title)
#[1] "(2016" "(2013" "(2015"

如果我们需要移除(,那么只捕获\\(作为一组后面的数字

sub("^[^(]*\\(([0-9]{4}).*", "\\1", title)
#[1] "2016" "2013" "2015"

或者使用str_extract,我们使用正则表达式环视来提取(

之后的4位数字
library(stringr)
str_extract(title, "(?<=\\()[0-9]{4}")
#[1] "2016" "2013" "2015"

regmatches/regexpr

regmatches(title, regexpr("(?<=\\()([0-9]{4})", title, perl = TRUE))
#[1] "2016" "2013" "2015"

数据

title <- c("La La Land (2016/I)", 
 "_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_", 
"dfajfj(2015)asdfjuwer f(2017)fa.erewr6")