R Rvest查询返回的URL

时间:2017-06-22 14:32:31

标签: r rvest

我正在使用Rvest来抓取一些数据,如果我打印我的网址变量,我会得到:

[32] "soccerstats.com/matches.asp?matchday=6"                                     

[33] "soccerstats.com/pmatch.asp?league=argentina&matchid=422&t1=5&t2=14&ly=2017" 

[34] "soccerstats.com/pmatch.asp?league=argentina&matchid=432&t1=23&t2=26&ly=2017"

[35] "soccerstats.com/pmatch.asp?league=argentina&matchid=425&t1=11&t2=10&ly=2017"

数据集中有多个网址,但我只对包含以下内容的网址感兴趣:

soccerstats.com/pmatch.asp?league=

我试图通过以下方式过滤它们:

oversdf <- data.frame(URLs=URLs)

rownames(oversdf) # This returns 1,2,3,4 etc as expected

grep("^soccerstats.com/pmatch.asp?league",rownames(oversdf)) # This then doesn't return any results

任何想法我做错了,我只想返回包含特定字符串的所有网址。

干杯

library(rvest)

URL <- "http://www.soccerstats.com/matches.asp" #Feed page

WS <- read_html (URL) #reads webpage into WS variable

URLs <- WS %>% html_nodes ("a:nth-child(1)") %>% html_attr("href") %>% as.character() # Get the CSS nodes & extract the URLs

URLs <- paste0("http://www.soccerstats.com/",URLs)

oversdf <- data.frame(URLs=URLs)

rownames(oversdf) #returns a vector of row names in the overs data.frame:

grep("^pmatch.asp?league",oversdf$URLs)

0 个答案:

没有答案