我试图从第3个表(玉米)中提取价格数据(高和低)。代码返回“无”:
import urllib2
from bs4 import BeautifulSoup
import time
import re
start_urls = 4539
nb_quotes = 10
for urls in range (start_urls, start_urls - nb_quotes, -1):
start_time = time.time()
# construct the URLs strings
url = 'http://markets.iowafarmbureau.com/markets/fixed.php?page=egrains'
# Read the HTML page content
page = urllib2.urlopen(url)
# Create a beautifulsoup object
soup = BeautifulSoup(page)
# Search the table to be parsed in the whole HTML code
tables = soup.findAll('table')
tab = tables[2] # This is the table to be parsed
low_tmp = str(tab.findAll('tr')[0].findAll('td')[1].getText()) #Low price
low = re.sub('[+]', '', low_tmp)
high_tmp = str(tab.findAll('tr')[0].findAll('td')[2].string) # High price
high = re.sub('[+]', '', high_tmp)
stop_time = time.time()
print low, '\t', high, '(%0.1f s)' % (stop_time - start_time)
答案 0 :(得分:1)
使用以下javascript调用在浏览器端填充表中的数据:
document.write(getQuoteboardHTML(
splitQuote(quotes, 'ZC*1,ZC*2,ZC*3,ZC*4,ZC*5,ZC*6,ZC*7,ZC*8,ZC*9'.split(/,/)),
'shortmonthonly,high,low,last,change'.split(/,/), { nospacers: true }));
BeautifulSoup
是一个HTML解析器 - 它不会执行javascript。
基本上,你需要一些东西来为你执行这个javascript。
一种解决方案是在selenium
的帮助下使用真正的浏览器:
from selenium import webdriver
url = "http://markets.iowafarmbureau.com/markets/fixed.php?page=egrains"
driver = webdriver.Firefox()
driver.get(url)
table = driver.find_element_by_xpath('//td[contains(div[@class="fixedpage_heading"], "CORN")]/table[@class="homepage_quoteboard"]')
for row in table.find_elements_by_tag_name('tr')[1:]:
month = row.find_element_by_class_name('quotefield_shortmonthonly').text
low = row.find_element_by_class_name('quotefield_low').text
high = row.find_element_by_class_name('quotefield_high').text
print month, low, high
driver.close()
打印:
SEP 329-0 338-0
DEC 335-6 345-4
MAR 348-2 358-0
MAY 356-6 366-0
JUL 364-0 373-4
SEP 372-0 379-4
DEC 382-0 390-2
MAR 392-4 399-0
MAY 400-0 405-0
另一个选择是“转向金属”,看看splitQuote()
和getQuoteboardHTML()
js函数实际上做了什么。使用浏览器开发人员工具,您可以看到有this url的基础请求,它返回一段javascript代码,其中包含所有包含页面表格数据的对象:
var quotes = { 'ZC*1': { name: 'Corn', flag: 's', price_2_close: '338.75', open_interest: '2701', tradetime: '20140911133000', symbol: 'ZCU14', open: '338', high: '338', low: '329', last: '331.75', change: '-7', pctchange: '-2.07', volume: '1623', exchange: 'CBOT', type: '2', unitcode: '-1', date: '14104 ... ', month: 'May 2015', shortmonth: 'May 2015' } };
如果你设法从中提取必要的部分 - 这将是你的第二个选择。