R web scraping:询问日期

时间:2014-05-22 19:40:10

标签: r

我可以从包含新闻

的网页中对网页进行网页抓取
library(XML)
webpage  <- "http://www.tradingeconomics.com/calendar"
tables <- readHTMLTable(webpage )
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

dfcal <- as.data.frame(tables$calendar)

例如,我怎样才能从2014年1月开始提取新闻? 我可以通过更改按钮设置在网页上执行此操作,但如何在R中执行此操作?

还有更好的方法从R内部收集经济新闻吗? 我看过http://www.rseek.org/但找不到任何东西。 谢谢您的帮助。

 <form method="post" action="/calendar" id="aspnetForm">
<div class="aspNetHidden">
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />
<input type="hidden" name="__LASTFOCUS" id="__LASTFOCUS" value="" />
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" 


<script type="text/javascript">
//<![CDATA[
var theForm = document.forms['aspnetForm'];
if (!theForm) {
theForm = document.aspnetForm;
}
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
    theForm.__EVENTTARGET.value = eventTarget;
    theForm.__EVENTARGUMENT.value = eventArgument;
    theForm.submit();
}
}
//]]>
</script>

2 个答案:

答案 0 :(得分:1)

我拦截了请求(它在#34的底部;答案&#34;)。这是一个相当丑陋的AJAX调用,因为它看起来像(我没有读过页面javascript)它base64编码了一个名为&#34; view state&#34;的东西。还有一个它通过的cookie,可能重要也可能不重要。在以下参数中,我不知道实际需要哪些参数,但您可以看到发送的所有HTTP请求标头和查询参数。尝试在建议@agstudy的POST中使用它们。

另一种选择是使用RSelenium来驱动浏览器并以此方式刮擦。


HTTP AJAX请求拦截如下:

POST /calendar HTTP/1.1
Host: www.tradingeconomics.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:29.0) Gecko/20100101 Firefox/29.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
X-Requested-With: XMLHttpRequest
X-MicrosoftAjax: Delta=true
Cache-Control: no-cache
Content-Type: application/x-www-form-urlencoded; charset=utf-8
Referer: http://www.tradingeconomics.com/calendar
Content-Length: 116721
Cookie: ASP.NET_SessionId=k2tc0xaqplgnps01mleehdcw; _ga=GA1.2.721625302.1402653647
Connection: keep-alive
Pragma: no-cache

ctl00%24AjaxScriptManager1%24ScriptManager1=ctl00%24ContentPlaceHolder1%24ctl02%24UpdatePanel1%7Cctl00%24ContentPlaceHolder1%24ctl02%24Button2&__EVENTTARGET=&__EVENTARGUMENT=&__LASTFOCUS=&__VIEWSTATE=GIANT_BASE64_STRING_THAT_I_REMOVED&srch-term=&ctl00%24ContentPlaceHolder1%24ctl02%24startDate=2014-01-01&ctl00%24ContentPlaceHolder1%24ctl02%24endDate=2014-01-31&ctl00%24ContentPlaceHolder1%24ctl02%24DropDownListTimezone=-300&ctl00%24ContentPlaceHolder1%24ctl02%24Country=top&ctl00%24ContentPlaceHolder1%24ctl02%24Category=&ctl00%24ContentPlaceHolder1%24ctl02%24Importance=&__ASYNCPOST=true&ctl00%24ContentPlaceHolder1%24ctl02%24Button2=ok

答案 1 :(得分:1)

正如@hrbrmstr所述,您可以使用RSelenium和Selenium操作浏览器:

require(RSelenium)
RSelenium::startServer()
remDr <- remoteDriver()
remDr$open(); Sys.sleep(15);
remDr$navigate("http://www.tradingeconomics.com/calendar")
# get the DOM element for the custom date
webElem <- remDr$findElement("xpath", "//a[@data-target=\"#datesDiv\"]")
# send a click to the element using javascript
remDr$executeScript("arguments[0].click();", list(webElem))
startDate <- remDr$findElement("id", "startDate")
startDate$clearElement()
startDate$sendKeysToElement(list("2014-01-01"))
endDate <- remDr$findElement("id", "endDate")
endDate$clearElement()
endDate$sendKeysToElement(list("2014-01-31"))

okButton <- remDr$findElement("id", "ctl00_ContentPlaceHolder1_ctl02_Button2")
okButton$clickElement()
Sys.sleep(15)
wData <- remDr$getPageSource()[[1]]

require(XML)
tables <- readHTMLTable(wData)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
dfcal <- as.data.frame(tables$calendar)

以上脚本打开网页。单击自定义日期。输入相关日期并单击“确定”按钮。 然后获取页面的html源代码。