Question

以下是我期望的结果

> title = "La La Land (2016/I)"
[1]"(2016" #result
> title = "_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_"
[1]"(2013" #result
> title = "dfajfj(2015)asdfjuwer f(2017)fa.erewr6"
[1]"(2015" #result

=============================================== ===================

以下是我通过应用代码sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")

获得的内容

> title = "_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_"
> sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
[1] "(1500-1800) (#1.1)"  #result. However, I expected it to be "(2013)"
> title = "La La Land (2016/I)"
> sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
[1] "(2016/I)" #result as I expect
> title = "dfajfj(2015)asdfjuwer f(2017)fa.erewr6"
> sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
[1]"(2017)" # result. However, I expect it to be "（2015)"

以下是我应用代码sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")

的内容

> title = "La La Land (2016/I)"
> sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
[1] "La La Land (2016/I)" #result. However, I expect it to be "(2016)"
> title = "dfajfj(2015)asdfjuwer f(2017)fa.erewr6" 
> sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
[1] "(2017)" #result. However, I expect it to be "(2015)"
> title = "_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_"
> sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
[1] "(2013)" #result as I expect

我查看了sub的说明，它说“sub执行第一场比赛的替换。在这种情况下，第一场比赛应该是(2013)。

总之，我尝试编写一个sub()命令来返回字符串中第一次出现的一年。

我猜我的代码有问题却无法找到，感谢有人能帮助我。

=============================================== ===================

事实上，我的最终目标是提取所有电影的年份。但是，我不知道如何一步到位。因此，我决定首先以(dddd格式查找年份，然后使用代码sub(pattern="\\((\\d{4}).*", a, replacement="\\1")查找年份的纯数。

例如：

> a= "(2015"
> sub(pattern="\\((\\d{4}).*", a, replacement="\\1")
[1] "2015"
> a= "(2015)"
> sub(pattern="\\((\\d{4}).*", a, replacement="\\1")
[1] "2015"

================= 更新时间05/29/2017 22:51 PM ============== =========

akrun的答案中的str_extract与我的数据集配合得很好。

但是，sub()不适用于所有数据。以下是我的所作所为。但是，我的代码不适用于所有500条记录。如果有人能指出我的代码上的错误，我真的很感激。我真的无法弄清楚自己。非常感谢你。

> t1
[1] "Man Who Fell to Earth (Remix) (2010) (TV)"
> t2
[1] "Manual pr\u0087ctico del amigo imaginario (abreviado) (2008)"
> title = c(t1,t2)
> x=gsub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
> x
[1] "(2010) (TV)" "(2008)"     
> sub(pattern="\\((.*)\\).*", x, replacement="\\1")
[1] "2010) (TV" "2008"

但是，我的目标是获得2010和2008。我的代码与t2一起使用，但在t1

时失败

Answer 1

我们可以从字符串的开头（(）匹配0个或更多不是[^(]*（^）的字符，后跟(和4个我们捕获的数字（\\([0-9]{4}）（(...)）后跟其他字符（.*）并替换为捕获组的反向引用（\\1）

sub("^[^(]*(\\([0-9]{4}).*", "\\1", title)
#[1] "(2016" "(2013" "(2015"

如果我们需要移除(，那么只捕获\\(作为一组后面的数字

sub("^[^(]*\\(([0-9]{4}).*", "\\1", title)
#[1] "2016" "2013" "2015"

或者使用str_extract，我们使用正则表达式环视来提取(

之后的4位数字

library(stringr)
str_extract(title, "(?<=\\()[0-9]{4}")
#[1] "2016" "2013" "2015"

或regmatches/regexpr

regmatches(title, regexpr("(?<=\\()([0-9]{4})", title, perl = TRUE))
#[1] "2016" "2013" "2015"

数据

title <- c("La La Land (2016/I)", 
 "_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_", 
"dfajfj(2015)asdfjuwer f(2017)fa.erewr6")

模式与sub（）匹配，无法捕获并替换第一次出现

1 个答案:

数据