在Python中使用请求模块时出错

时间:2018-07-25 14:26:34

标签: python python-requests

我最近一直在尝试使用python中的请求模块制作网络抓取工具。

起初它正在运行,然后收到Response 403错误,然后返回并测试我已经抓取的网站的输出时,出现了Response 200错误。我想知道是否有人知道为什么会这样。

在下面的代码中,我得到了collect_omers和collect_real的响应200,然后是collect_bdc的响应403。 谢谢

import requests,bs4

def collect_omers():
    acquired_list = []
    logo_list = []
    omers_html = requests.get('https://www.omersventures.com/portfolio-summary')
    print(omers_html)
    omers_soup = bs4.BeautifulSoup(omers_html.text,"html.parser")
    omers_tags = omers_soup.select('.field-content a')
    for logo in omers_tags:
        if "portfolio" in str(logo) and logo.get_text() != "":
            if "acquired" in logo.get_text().lower():
                acquired_list.append(logo.get_text())
            else:
                logo_list.append(logo.get_text())

def collect_real():
    acquired_list = []
    logo_list = []
    real_html = requests.get('https://realventures.com/backing/')
    print(real_html)
    real_soup = bs4.BeautifulSoup(real_html.text,"html.parser")
    real_tags = real_soup.select('.company-list__grid-item')
    count = 1
    for logo in real_tags:
        listed = logo.get_text().strip().split("\n")
        if len(listed)>3:
            acquired_list.append(listed[0].strip() + " " + "(" + listed[3] + ")")
        else:
            logo_list.append(listed[0].strip())

def collect_bdc():
    acquired_list = []
    logo_list = []
    bdc_html = requests.get('https://www.inovia.vc/portfolio/')
    print(bdc_html)
    bdc_soup = bs4.BeautifulSoup(bdc_html.text,"html.parser")

    bdc_tags = bdc_soup.select('.row')
    count = 1
    for logo in bdc_tags:
        print(logo.get_text())


collect_real()

1 个答案:

答案 0 :(得分:0)

响应200很好-这意味着您的请求已通过并返回了响应。

第3个站点的403响应确实表示出了点问题。看一下它,似乎第三个站点会自动拒绝不提供用户代理标头的GET请求。您可以找到自己的用户代理标题,方法是在Chrome中按F12,单击“网络”标签,导航到该站点,然后在列表中单击相应的请求。用户代理标头将位于“请求标头”部分下。您必须通过requests.get() headers关键字参数提供此标头。

代码将如下所示:

    bdc_html = requests.get('https://www.inovia.vc/portfolio/', headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'})