我正在尝试从网站检索数据。我的代码如下:
import re
from urllib2 import urlopen
from bs4 import BeautifulSoup
# gets a file-like object using urllib2.urlopen
url = 'http://ecal.forexpros.com/e_cal.php?duration=weekly'
html = urlopen(url)
soup = BeautifulSoup(html)
# loops over all <tr> elements with class 'ec_bg1_tr' or 'ec_bg2_tr'
for tr in soup.find_all('tr', {'class': re.compile('ec_bg[12]_tr')}):
# finds desired data by looking up <td> elements with class names
event = tr.find('td', {'class': 'ec_td_event'}).text
currency = tr.find('td', {'class': 'ec_td_currency'}).text
actual = tr.find('td', {'class': 'ec_td_actual'}).text
forecast = tr.find('td', {'class': 'ec_td_forecast'}).text
previous = tr.find('td', {'class': 'ec_td_previous'}).text
time = tr.find('td', {'class': 'ec_td_time'}).text
importance = tr.find('td', {'class': 'ec_td_importance'}).img.get('alt')
# the returned strings are unicode, so to print them we need to use a unicode string
if importance == 'High':
print(u'\t{:5}\t{}\t{:3}\t{:40}\t{:8}\t{:8}\t{:8}'.format(time, importance, currency, event, actual, forecast, previous))
结果集中的前几条记录如下:
05:00 High EUR CPI (YoY) 1.3% 1.3% 1.3%
10:00 High USD Pending Home Sales (MoM) 1.5% 0.7% -0.7%
21:45 High CNY Caixin Manufacturing PMI 51.1 50.4 50.4
00:30 High AUD RBA Interest Rate Decision 1.50% 1.50% 1.50%
00:30 High AUD RBA Rate Statement
03:55 High EUR German Manufacturing PMI 58.1 58.3 58.3
03:55 High EUR German Unemployment Change -9K -5K 6K
我正在尝试从以下网站检索类似数据:
https://www.fxstreet.com/economic-calendar
为此,我修改了上述代码如下:
import re
from urllib2 import urlopen
from bs4 import BeautifulSoup
# gets a file-like object using urllib2.urlopen
url = 'https://www.fxstreet.com/economic-calendar'
html = urlopen(url)
soup = BeautifulSoup(html)
for tr in soup.find_all('tr', {'class': re.compile('fxst-tr-event fxst-oddRow fxit-eventrow fxst-evenRow ')}):
# finds desired data by looking up <div> elements with class names
event = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text
currency = tr.find('div', {'class': 'fxit-event-name'}).text
actual = tr.find('div', {'class': ' fxit-actual'}).text
forecast = tr.find('div', {'class': 'fxit-consensus'}).text
previous = tr.find('div', {'class': 'fxst-td-previous fxit-previous'}).text
time = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text
# importance = tr.find('td', {'class': 'ec_td_importance'}).img.get('alt')
# the returned strings are unicode, so to print them we need to use a unicode string
if importance == 'High':
print(u'\t{:5}\t{:3}\t{:40}\t{:8}\t{:8}\t{:8}'.format(time, currency, event, actual, forecast, previous))
此代码不会返回任何结果(可能是因为我引用了错误的标记和/或类)。有谁看到我的错误在哪里?
谢谢!
答案 0 :(得分:1)
您应该使用 selenium
+ Chromedriver
/ PhantomJS
来解析动态创建的JavaScript内容urllib2
没有办法解决这个问题。我不认为在这里使用regex
很有意义,您可以使用lxml
解析器来允许多个类并在列表中使用它们。以下是使用已经提到的工具的示例:
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://www.fxstreet.com/economic-calendar'
driver = webdriver.Chrome()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
for tr in soup.findAll('tr',{'class':['fxst-tr-event', 'fxst-oddRow', 'fxit-eventrow', 'fxst-evenRow', 'fxs_cal_nextEvent']}):
event = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text
currency = tr.find('div', {'class': 'fxit-event-name'}).text
actual = tr.find('div', {'class': 'fxit-actual'}).text
forecast = tr.find('div', {'class': 'fxit-consensus'}).text
previous = tr.find('div', {'class': 'fxst-td-previous fxit-previous'}).text
time = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text
print(time, currency, event, actual, forecast, previous)
注意lxml
本身就是一个库,您可以使用标准html.parser
处理多个类,但在我看来并不直观。此代码打印:
14:00
CAD 14:00 None 59.2
61.6
14:00
CAD 14:00 52.9
63.9
17:00
USD 17:00 765
...
...
我没有改变任何变量,因为我不确定你想要它们是什么,所以进一步调整它并格式化输出应该是理想的。