我正在使用机械化& BeautifulSoup用于抓取一个html网页,该网页本质上是一个包含超过100000行数据的html webquery。我下面的代码能够抓取数据,但只能抓取其中的一部分 - 准确地说是24998
import mechanize, urllib2, cookielib
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open("https://www.google.com/doubleclick/search/reports/download?ay=20700000000000476&av=0&rid=28047&of=webqueryphtml")
br.select_form(nr=0)
br.form['Email'] = 'xxxxx@gmail.com'
br.form['Passwd'] = 'xxxxx'
br.submit()
soup = BeautifulSoup(br.response().read())
file = open('webquery_out.txt','w')
for row in soup.findAll('tr')[1:]:
col = row.findAll('td')
Row_Type = col[0].get_text(' ', True)
Status = col[1].get_text(' ', True)
Conv_ID = col[2].get_text(' ', True)
Ad_Conv_ID = col[3].get_text(' ', True)
Conv_Rev = col[4].get_text(' ', True)
Conv_Org_Rev = col[5].get_text(' ', True)
Engine = col[6].get_text(' ', True)
Adv_ID = col[7].get_text(' ', True)
record = (Row_Type,Status,Conv_ID,Ad_Conv_ID,Conv_Rev,Conv_Org_Rev,Engine,Adv_ID)
line = "|".join(record)
file.write(line + '\n')
我不知道是用户代理还是导致部分结果的写入。非常感谢任何帮助