使用mechanize&amp ;;刮网页BeautifulSoup仅返回实际结果集的子集

时间:2015-02-23 05:38:45

标签: python web-scraping beautifulsoup mechanize

我正在使用机械化& BeautifulSoup用于抓取一个html网页,该网页本质上是一个包含超过100000行数据的html webquery。我下面的代码能够抓取数据,但只能抓取其中的一部分 - 准确地说是24998

import mechanize, urllib2, cookielib
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)

br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

br.open("https://www.google.com/doubleclick/search/reports/download?ay=20700000000000476&av=0&rid=28047&of=webqueryphtml")

br.select_form(nr=0)
br.form['Email'] = 'xxxxx@gmail.com'
br.form['Passwd'] = 'xxxxx'
br.submit()

soup = BeautifulSoup(br.response().read())

file = open('webquery_out.txt','w')

for row in soup.findAll('tr')[1:]:
    col = row.findAll('td')
    Row_Type = col[0].get_text(' ', True)
    Status = col[1].get_text(' ', True)
    Conv_ID = col[2].get_text(' ', True)
    Ad_Conv_ID = col[3].get_text(' ', True)
    Conv_Rev = col[4].get_text(' ', True)
    Conv_Org_Rev = col[5].get_text(' ', True)
    Engine = col[6].get_text(' ', True)
    Adv_ID = col[7].get_text(' ', True)
record = (Row_Type,Status,Conv_ID,Ad_Conv_ID,Conv_Rev,Conv_Org_Rev,Engine,Adv_ID)
line = "|".join(record)
file.write(line + '\n')

我不知道是用户代理还是导致部分结果的写入。非常感谢任何帮助

0 个答案:

没有答案