我正在使用Rvest
来抓取一些数据,如果我打印我的网址变量,我会得到:
[32] "soccerstats.com/matches.asp?matchday=6"
[33] "soccerstats.com/pmatch.asp?league=argentina&matchid=422&t1=5&t2=14&ly=2017"
[34] "soccerstats.com/pmatch.asp?league=argentina&matchid=432&t1=23&t2=26&ly=2017"
[35] "soccerstats.com/pmatch.asp?league=argentina&matchid=425&t1=11&t2=10&ly=2017"
数据集中有多个网址,但我只对包含以下内容的网址感兴趣:
soccerstats.com/pmatch.asp?league=
我试图通过以下方式过滤它们:
oversdf <- data.frame(URLs=URLs)
rownames(oversdf) # This returns 1,2,3,4 etc as expected
grep("^soccerstats.com/pmatch.asp?league",rownames(oversdf)) # This then doesn't return any results
任何想法我做错了,我只想返回包含特定字符串的所有网址。
干杯
library(rvest)
URL <- "http://www.soccerstats.com/matches.asp" #Feed page
WS <- read_html (URL) #reads webpage into WS variable
URLs <- WS %>% html_nodes ("a:nth-child(1)") %>% html_attr("href") %>% as.character() # Get the CSS nodes & extract the URLs
URLs <- paste0("http://www.soccerstats.com/",URLs)
oversdf <- data.frame(URLs=URLs)
rownames(oversdf) #returns a vector of row names in the overs data.frame:
grep("^pmatch.asp?league",oversdf$URLs)