我想从以下页面中删除网址:
http://www.europarl.europa.eu/meps/en/1186/seeall.html?type=CRE&leg=5
从这个页面收集了180个网址(每个都是国会发表的演讲的链接),但是每当有超过100个网址需要删除时我会遇到问题,因为附加的演讲只能访问单击页面底部的“查看更多”框。我试图找出如何揭示我认为被“getMore”函数隐藏的其他链接,但没有运气!在这里为天真的道歉...
我目前的代码如下:
mep.speech.list.url <-"http://www.europarl.europa.eu/meps/en/1186/seeall.html?type=CRE&leg=5"
speech.list.data<-try(readLines(mep.speech.list.url),silent=TRUE)
mep.speech.list<-speech.list.data
mep.speech.lines<-grep("href",mep.speech.list)
mep.speech.list<-mep.speech.list[mep.speech.lines]
mep.speech.lines<-grep("target",mep.speech.list)
mep.speech.list<-mep.speech.list[mep.speech.lines]
mep.speech.list<-mep.speech.list[-length(mep.speech.list)]
mep.speech.list.end<-regexpr("target",mep.speech.list)
mep.speech.list<-substr(mep.speech.list,1, mep.speech.list.end)
mep.speech.list<-gsub("\t","",mep.speech.list)
mep.speech.list<-gsub('<a href=\"',"",mep.speech.list)
mep.speech.list<-gsub('\" target',"",mep.speech.list)
mep.speech.list<-gsub('\" targe',"",mep.speech.list)
mep.speech.list<-gsub('\" targ',"",mep.speech.list)
mep.speech.list<-gsub('\" tar',"",mep.speech.list)
mep.speech.list<-gsub('\" ta',"",mep.speech.list)
mep.speech.list<-gsub('\" t',"",mep.speech.list)
mep.speech.list<-mep.speech.list[5:length(mep.speech.list)]
print(mep.speech.list)
答案 0 :(得分:3)
SEE MORE按钮执行一些执行AJAX调用的javascript。你可以使用Selenium 自动化浏览器并提取链接:
require(RSelenium)
appURL <- "http://www.europarl.europa.eu/meps/en/1186/seeall.html?type=CRE&leg=5"
RSelenium::startServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate(appURL)
remDr$findElement("id", "seemore")$clickElement()
Sys.sleep(5)
jsScript <-"var hrefs = new Array();
$('#content_left .listcontent a').each(function(){
hrefs.push($(this).attr('href'));
});
return hrefs;"
appHREF <- remDr$executeScript(jsScript)[[1]]
> head(appHREF)
[1] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040504+ITEM-008+DOC+XML+V0//EN&language=en&query=INTERV&detail=2-205"
[2] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040422+ITEM-005+DOC+XML+V0//EN&language=en&query=INTERV&detail=4-069"
[3] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040422+ITEM-005+DOC+XML+V0//EN&language=en&query=INTERV&detail=4-122"
[4] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040421+ITEM-008+DOC+XML+V0//EN&language=en&query=INTERV&detail=3-207"
[5] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040330+ITEM-004+DOC+XML+V0//EN&language=en&query=INTERV&detail=2-074"
[6] "http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20040330+ITEM-004+DOC+XML+V0//EN&language=en&query=INTERV&detail=2-099"
>