我需要运行一个每天抓取以下网站的脚本(当脚本运行时它会刮掉当天的日历)(相当于点击“每日”按钮)
http://www.fxempire.com/economic-calendar/
我想提取该特定日期的所有日期数据/事件,并过滤相关货币(如果适用),然后创建某种警报或在每个事件发生前10分钟弹出
到目前为止,我使用下面的代码来抓取网页,然后查看/打印变量“html”,但找不到我需要的日历信息。
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://www.fxempire.com/economic-calendar/'
r = Render(url)
html = r.frame.toHtml()
答案 0 :(得分:2)
在我看来,从网页抓取数据的最佳方法是使用BeautifulSoup。这是一个快速的脚本,可以获得你想要的数据。
import re
from urllib2 import urlopen
from bs4 import BeautifulSoup
# Get a file-like object using urllib2.urlopen
url = 'http://ecal.forexpros.com/e_cal.php?duration=daily'
html = urlopen(url)
# BS accepts a lot of different data types, so you don't have to do e.g.
# urlopen(url).read(). It accepts file-like objects, so we'll just send in html
# as a parameter.
soup = BeautifulSoup(html)
# Loop over all <tr> elements with class 'ec_bg1_tr' or 'ec_bg2_tr'
for tr in soup.find_all('tr', {'class': re.compile('ec_bg[12]_tr')}):
# Find the event, currency and actual price by looking up <td> elements
# with class names.
event = tr.find('td', {'class': 'ec_td_event'}).text
currency = tr.find('td', {'class': 'ec_td_currency'}).text
actual = tr.find('td', {'class': 'ec_td_actual'}).text
# The returned strings which are returned are unicode, so to print them,
# we need to use a unicode string.
print u'{:3}\t{:6}\t{}'.format(currency, actual, event)
为了向您提供一些如何在将来解决此类问题的提示,我已经写下了解决问题时使用的步骤。希望它有所帮助。
Inspect Element
。 iframe
,然后打开该网址。<tr>
个元素,并且具有类ec_bg1_tr
或ec_bg2_tr
。tr
找到所有ec_bg1_tr
个元素soup.find_all('tr', {'class': 'ec_bg1_tr'})
元素。我的第一个问题是首先遍历这些元素,然后循环遍历ec_bg2_tr
个元素。