BeautifulSoup Scraping td& TR

时间:2014-09-11 18:47:35

标签: python html html-parsing beautifulsoup

我试图从第3个表(玉米)中提取价格数据(高和低)。代码返回“无”:

import urllib2                          
from bs4 import BeautifulSoup           
import time                           
import re                               
start_urls = 4539                       
nb_quotes = 10                          
for urls in range (start_urls, start_urls - nb_quotes, -1):

    start_time = time.time()

    # construct the URLs strings
    url = 'http://markets.iowafarmbureau.com/markets/fixed.php?page=egrains' 

    # Read the HTML page content
    page = urllib2.urlopen(url)

    # Create a beautifulsoup object
    soup = BeautifulSoup(page)

    # Search the table to be parsed in the whole HTML code
    tables = soup.findAll('table')
    tab = tables[2]                 # This is the table to be parsed   

    low_tmp = str(tab.findAll('tr')[0].findAll('td')[1].getText())     #Low price
    low = re.sub('[+]', '', low_tmp)                                
    high_tmp = str(tab.findAll('tr')[0].findAll('td')[2].string)    # High price
    high = re.sub('[+]', '', high_tmp)                             


    stop_time = time.time()


    print low, '\t', high, '(%0.1f s)' % (stop_time - start_time)

1 个答案:

答案 0 :(得分:1)

使用以下javascript调用在浏览器端填充表中的数据:

document.write(getQuoteboardHTML(
    splitQuote(quotes, 'ZC*1,ZC*2,ZC*3,ZC*4,ZC*5,ZC*6,ZC*7,ZC*8,ZC*9'.split(/,/)),
    'shortmonthonly,high,low,last,change'.split(/,/), { nospacers: true }));

BeautifulSoup是一个HTML解析器 - 它不会执行javascript。

基本上,你需要一些东西来为你执行这个javascript。

一种解决方案是在selenium的帮助下使用真正的浏览器:

from selenium import webdriver


url = "http://markets.iowafarmbureau.com/markets/fixed.php?page=egrains"

driver = webdriver.Firefox()
driver.get(url)

table = driver.find_element_by_xpath('//td[contains(div[@class="fixedpage_heading"], "CORN")]/table[@class="homepage_quoteboard"]')
for row in table.find_elements_by_tag_name('tr')[1:]:
    month = row.find_element_by_class_name('quotefield_shortmonthonly').text
    low = row.find_element_by_class_name('quotefield_low').text
    high = row.find_element_by_class_name('quotefield_high').text

    print month, low, high

driver.close()

打印:

SEP 329-0 338-0
DEC 335-6 345-4
MAR 348-2 358-0
MAY 356-6 366-0
JUL 364-0 373-4
SEP 372-0 379-4
DEC 382-0 390-2
MAR 392-4 399-0
MAY 400-0 405-0

另一个选择是“转向金属”,看看splitQuote()getQuoteboardHTML() js函数实际上做了什么。使用浏览器开发人员工具,您可以看到有this url的基础请求,它返回一段javascript代码,其中包含所有包含页面表格数据的对象:

var quotes = { 'ZC*1': { name: 'Corn', flag: 's', price_2_close: '338.75', open_interest: '2701', tradetime: '20140911133000', symbol: 'ZCU14', open: '338', high: '338', low: '329', last: '331.75', change: '-7', pctchange: '-2.07', volume: '1623', exchange: 'CBOT', type: '2', unitcode: '-1', date: '14104 ... ', month: 'May 2015', shortmonth: 'May 2015' } };

如果你设法从中提取必要的部分 - 这将是你的第二个选择。