BeautifulSoup没有正确地采取所有的HTML

时间:2012-02-04 12:59:46

标签: python beautifulsoup mechanize

我正在尝试使用BeautifulSoup和Mechanize in Python为一个学术项目编写一个简单的抓取程序。我试图从亚马逊获得一些产品的价格,因为我想测试他们的定价模型的各种理论。我遇到的问题是BeautifulSoup随机地不从Mechanize中获取整个HTML页面。我已经打印到文本文件中有错误的时间以及每次完成Mechanize页面时,但BeautifulSoup页面只有一半。这是我的代码:

def process_product_url(product_url):
    """Scrapes and returns all the data in the given product url"""
    #Download product_page given product_url
    product_page_mech, product_page_bs = get_product_page_mech_bs(product_url)

    #Extract Price
    price = extract_price(product_page_bs)
    return price

def get_product_page_mech_bs(url):
    """Takes a product page url in str and returns the mech page and bs page"""
    while True:
        mech_page = get_mech_page(url)
        bs_page = BeautifulSoup(unicode(mech_page.response().read(), 'latin-1'))
        if not test_product_page(bs_page):
            log(unicode(bs_page))
            log(unicode(mech_page.response().read(), 'latin-1'))
            continue
    return mech_page, bs_page

def test_product_page(product_page_bs):
    """Takes a BS product page and tests to see if proper"""
    if rank_page_bs.findAll('span', attrs={'id' : 'actualPriceValue'}) == []:
        return False
    else:
        return True

def get_mech_page(url):
    """Given a URL, returns Mechanize page object"""
    while True:
        try:
            br = initialize_browser()
            br.open(url)
            return br
        except Exception, e:
            print e
            print traceback.print_exc()
            continue

def initialize_browser():
    """Returns a fully setup mechanize browser instance"""
    br = mechanize.Browser()
    br.addheaders = [("User-agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:9.0.1) Gecko/20100101 Firefox/9.0.1")]
    return br

我已上传此页面的BeautifulSoup outputMechanize output:http://www.amazon.com/Fujifilm-X-Pro-Digital-Camera-Body/dp/B006UV6YMQ/ref= sr_1_2?s = electronics& ie = UTF8& qid = 1328359488& sr = 1-2(我不能粘贴两个以上的链接)

编辑:澄清&扩展

1 个答案:

答案 0 :(得分:2)

我这样做了:

from BeautifulSoup import BeautifulSoup
import mechanize

def get_page_mech_bs(url):
    """Takes a page url and returns the mech page and bs page"""
    while True:
        mech_page = get_mech_page(url)
        bs_page = BeautifulSoup(unicode(mech_page.response().read(), 'latin-1'))
        if not test_page(bs_page):
            print "Error in page, redownloading"
            log(unicode(bs_page))
            log(unicode(mech_page.response().read(), 'latin-1'))
            continue
        else:
            break
    return mech_page, bs_page

def get_mech_page(url):
    br = mechanize.Browser()
    br.open(url)
    return br

def test_page(bs_page):
    return True

if __name__ == '__main__':
    print get_page_mech_bs("http://google.com")

我不知道如何编写test_page。当test_page为True时,我会从循环中断开。我从BeautifulSoup获得的HTML看起来是正确的。有什么问题?