以下是我期望的结果
> title = "La La Land (2016/I)"
[1]"(2016" #result
> title = "_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_"
[1]"(2013" #result
> title = "dfajfj(2015)asdfjuwer f(2017)fa.erewr6"
[1]"(2015" #result
=============================================== ===================
以下是我通过应用代码sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
> title = "_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_"
> sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
[1] "(1500-1800) (#1.1)" #result. However, I expected it to be "(2013)"
> title = "La La Land (2016/I)"
> sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
[1] "(2016/I)" #result as I expect
> title = "dfajfj(2015)asdfjuwer f(2017)fa.erewr6"
> sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
[1]"(2017)" # result. However, I expect it to be "(2015)"
以下是我应用代码sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
> title = "La La Land (2016/I)"
> sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
[1] "La La Land (2016/I)" #result. However, I expect it to be "(2016)"
> title = "dfajfj(2015)asdfjuwer f(2017)fa.erewr6"
> sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
[1] "(2017)" #result. However, I expect it to be "(2015)"
> title = "_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_"
> sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
[1] "(2013)" #result as I expect
我查看了sub
的说明,它说“sub执行第一场比赛的替换。在这种情况下,第一场比赛应该是(2013)
。
总之,我尝试编写一个sub()
命令来返回字符串中第一次出现的一年。
我猜我的代码有问题却无法找到,感谢有人能帮助我。
=============================================== ===================
事实上,我的最终目标是提取所有电影的年份。但是,我不知道如何一步到位。因此,我决定首先以(dddd
格式查找年份,然后使用代码sub(pattern="\\((\\d{4}).*", a, replacement="\\1")
查找年份的纯数。
例如:
> a= "(2015"
> sub(pattern="\\((\\d{4}).*", a, replacement="\\1")
[1] "2015"
> a= "(2015)"
> sub(pattern="\\((\\d{4}).*", a, replacement="\\1")
[1] "2015"
================= 更新时间05/29/2017 22:51 PM ============== =========
akrun的答案中的str_extract
与我的数据集配合得很好。
但是,sub()
不适用于所有数据。以下是我的所作所为。但是,我的代码不适用于所有500条记录。如果有人能指出我的代码上的错误,我真的很感激。我真的无法弄清楚自己。非常感谢你。
> t1
[1] "Man Who Fell to Earth (Remix) (2010) (TV)"
> t2
[1] "Manual pr\u0087ctico del amigo imaginario (abreviado) (2008)"
> title = c(t1,t2)
> x=gsub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
> x
[1] "(2010) (TV)" "(2008)"
> sub(pattern="\\((.*)\\).*", x, replacement="\\1")
[1] "2010) (TV" "2008"
但是,我的目标是获得2010
和2008
。我的代码与t2
一起使用,但在t1
答案 0 :(得分:1)
我们可以从字符串的开头((
)匹配0个或更多不是[^(]*
(^
)的字符,后跟(
和4个我们捕获的数字(\\([0-9]{4}
)((...)
)后跟其他字符(.*
)并替换为捕获组的反向引用(\\1
)
sub("^[^(]*(\\([0-9]{4}).*", "\\1", title)
#[1] "(2016" "(2013" "(2015"
如果我们需要移除(
,那么只捕获\\(
作为一组后面的数字
sub("^[^(]*\\(([0-9]{4}).*", "\\1", title)
#[1] "2016" "2013" "2015"
或者使用str_extract
,我们使用正则表达式环视来提取(
library(stringr)
str_extract(title, "(?<=\\()[0-9]{4}")
#[1] "2016" "2013" "2015"
或regmatches/regexpr
regmatches(title, regexpr("(?<=\\()([0-9]{4})", title, perl = TRUE))
#[1] "2016" "2013" "2015"
title <- c("La La Land (2016/I)",
"_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_",
"dfajfj(2015)asdfjuwer f(2017)fa.erewr6")