删除R中的部分URL字符串

时间:2017-06-13 08:55:15

标签: r regex gsub

我有一堆表格的网址:

http://www.imdb.com/title/tt0383574/?ref_=adv_li_tt

我想留下电影代码(即0383574)。我试过这个:

url = "http://www.imdb.com/search/title?genres=action&title_type=feature&sort=moviemeter,asc"

page = read_html(url)

movie.nodes <- html_nodes(page,'.lister-item-header a')
movie.nodes
movie.link <- sapply(html_attrs(movie.nodes),`[[`,'href')
movie.link <- paste0("http://www.imdb.com",movie.link)
movie.id1 <- gsub("http://www.imdb.com/title/tt", "", movie.link)
movie.id <- gsub("/?ref_=adv_li_tt", "", movie.id1)

但是调用movie.id会返回:

[1]  "0451279/?" "2345759/?" "1790809/?" "1469304/?" "0974015/?" "3896198/?"
[7]  "3371366/?" "3890160/?" "3315342/?" "4425200/?" "2250912/?" "2406566/?"
[13] "1972591/?" "1825683/?" "2091256/?" "3501632/?" "4630562/?" "1386697/?"
[19] "4154756/?" "4116284/?" "2975590/?" "5884234/?" "5013056/?" "1211837/?"
[25] "0120616/?" "2527336/?" "1082807/?" "0325980/?" "1293847/?" "2034800/?"
[31] "2015381/?" "2911666/?" "1648190/?" "4912910/?" "1298650/?" "1477834/?"
[37] "2334871/?" "3748528/?" "2239822/?" "3469046/?" "2461150/?" "3731562/?"
[43] "1431045/?" "0449088/?" "3385516/?" "2226597/?" "0468569/?" "1219827/?"
[49] "0383574/?" "3498820/?"

如何从输出中删除/??提前谢谢。

3 个答案:

答案 0 :(得分:2)

将影片ID视为唯一具有数字的部分,您可以删除任何其他不是数字的字符,为您留下如下所示的ID:

> gsub("[^[:digit:]]", "", movie.link)
 [1] "0451279" "2345759" "1790809" "1469304" "0974015" "3896198" "3371366" "3890160" "3315342" "4425200"
[11] "2250912" "2406566" "1972591" "1825683" "2091256" "3501632" "4630562" "1386697" "4154756" "4116284"
[21] "2975590" "5884234" "5013056" "1211837" "0120616" "2527336" "1082807" "0325980" "1293847" "2034800"
[31] "2015381" "2911666" "1648190" "4912910" "1298650" "1477834" "2334871" "3748528" "2239822" "3469046"
[41] "2461150" "3731562" "1431045" "0449088" "3385516" "2226597" "0468569" "1219827" "0383574" "3498820"

答案 1 :(得分:1)

刚刚找到了怎么做:

movie.id <- gsub("\\D", "", movie.link)

因为\\D删除了任何不是数字的内容。

答案 2 :(得分:1)

gsub接受正则表达式模式作为其第一个参数。在正则表达式中,?是一个特殊字符,表示前面的字符可能出现零次或一次。

因此,您目前正在搜索ref_=adv_li_tt /,其前后是?

您需要转义gsub("/\?ref_=adv_li_tt", "", movie.id1) 以表示您正在搜索文字问号:

{{1}}