Question

我的脚本抓取了网页的多个页面，这些页面是通过在POST参数中发送页码来抓取的

我的功能是：

def hit_url_and_scrape(headers, payload):
    print ("hitting with {} as page number !!!".format(payload['page']))
    doc = requests.post('https://www.sci.gov.in/php/getPartyDetails.php', headers=headers, data=payload)
    print ("I just got response for the {} th page number".format(payload['page']))

    return scrap(bs4.BeautifulSoup(doc.text, 'lxml'))


def main():
    eng_page = 4
    headers = {
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'Host': 'www.sci.gov.in',
        'Origin': 'https://www.sci.gov.in',
        'Referer': 'https://www.sci.gov.in/case-status',
        'X-Requested-With': 'XMLHttpRequest'
    }

    data = {
        'PartyType': '',
        'PartyName': party_name,
        'PartyYear': year,
        'PartyStatus': 'P',
        'page': page_count,
    }
    with ThreadPoolExecutor(max_workers=8) as executor:
            futures = []
            results = []    
            for page in range(2, int(end_page)+1):
                data['page'] = page
                futures.append(executor.submit(hit_url_and_scrape, headers, data))
            for result in as_completed(futures):
                print (len(result._result))
                results.extend(result._result)
            print ("#####################################################################33")

我的打印日志是：

hitting with 2 as page number !!!      
hitting with 3 as page number !!!     
hitting with 4 as page number !!!        
I just got response for the 4 th page number   
48    
I just got response for the 4 th page number    
48    
I just got response for the 4 th page number   
48

如您所见，在我的日志中，该函数接收正确的参数，但是在命中请求之前，每个请求的参数都与上一个Future对象的参数相同。结果的长度与最后一页的结果相同。我已经在Python2.7和3.5中试用了我的脚本。

将来的对象获得正确的参数，但是它们最终以与上一个将来的对象相同的参数执行任务

0 个答案: