我的脚本抓取了网页的多个页面,这些页面是通过在POST参数中发送页码来抓取的
我的功能是:
def hit_url_and_scrape(headers, payload):
print ("hitting with {} as page number !!!".format(payload['page']))
doc = requests.post('https://www.sci.gov.in/php/getPartyDetails.php', headers=headers, data=payload)
print ("I just got response for the {} th page number".format(payload['page']))
return scrap(bs4.BeautifulSoup(doc.text, 'lxml'))
def main():
eng_page = 4
headers = {
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Host': 'www.sci.gov.in',
'Origin': 'https://www.sci.gov.in',
'Referer': 'https://www.sci.gov.in/case-status',
'X-Requested-With': 'XMLHttpRequest'
}
data = {
'PartyType': '',
'PartyName': party_name,
'PartyYear': year,
'PartyStatus': 'P',
'page': page_count,
}
with ThreadPoolExecutor(max_workers=8) as executor:
futures = []
results = []
for page in range(2, int(end_page)+1):
data['page'] = page
futures.append(executor.submit(hit_url_and_scrape, headers, data))
for result in as_completed(futures):
print (len(result._result))
results.extend(result._result)
print ("#####################################################################33")
我的打印日志是:
hitting with 2 as page number !!!
hitting with 3 as page number !!!
hitting with 4 as page number !!!
I just got response for the 4 th page number
48
I just got response for the 4 th page number
48
I just got response for the 4 th page number
48
如您所见,在我的日志中,该函数接收正确的参数,但是在命中请求之前,每个请求的参数都与上一个Future对象的参数相同。结果的长度与最后一页的结果相同。我已经在Python2.7和3.5中试用了我的脚本。