python request.get()

时间:2019-02-06 00:27:18

标签: python-requests

我正在用政治捐款进行一些Web抓取,并且有一个链接是我要从一页上抓取到的,然后才需要抓取。我可以很好地获得辅助链接,但是,当我尝试发送request.get()调用时,从调用返回的html给我一个错误的请求400错误。

我已经尝试通过更改或添加更多标头来更改请求,但似乎无济于事。

headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept - Encoding": "gzip, deflate",
        "Accept-Language": "en-US,en;q=0.9",
        "Cache - Control": "max - age = 0",
        "Connection": "keep-alive",
        "DNT": "1",
        "Host": "docquery.fec.gov",
        "Referer": "http://www.politicalmoneyline.com/tr/tr_MG_IndivDonor.aspx?tm=3",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
    }

params = {
        "14960391627": ""
    }

pdf_page = requests.get(potential_donor[10], headers=headers, params=params)
html = pdf_page.text
soup_donor_page = BeautifulSoup(html, 'html.parser')
print(soup_donor_page)   

注意:网站的网址应如下所示: http://docquery.fec.gov/cgi-bin/fecimg/?14960391627 末尾数字不同

打印的输出(soup_donor_page)为:

400错误的请求

    你的浏览器发送了一个无效的请求。     

我需要获取页面的实际html才能从页面中获取嵌入式pdf。

1 个答案:

答案 0 :(得分:0)

我怀疑原因是issuerequests的产生是因为它被提供了没有值的参数。

尝试使用格式字符串来构建网址:

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}

param = "14960391627"
r = requests.get(f"http://docquery.fec.gov/cgi-bin/fecimg/?{param}", headers=headers)
soup = BeautifulSoup(r.content, "html.parser")

print(soup.find("embed")["src"])

结果:

http://docquery.fec.gov/pdf/859/14960388859/14960388859_002769.pdf#zoom=fit&navpanes=0