Question

抓取数据

import requests
from bs4 import BeautifulSoup

res = requests.get('https://angel.co/pen-io')
soup = BeautifulSoup(res.content, 'html.parser')
print(soup.prettify())

这将标题标签打印为“找不到页面 - 404 - AngelList”。在webbrowser中，网站工作正常，但其源代码与我的python脚本的输出不同。我也使用了selenium和phantomjs，但它显示了相同的东西

Answer 1

看起来angel.co将根据发送的HTTP 404以User-Agent回复，看起来它会阻止默认的requests代理（可能取决于版本）。这可能会阻碍机器人活动。

我的ipython会话的一些输出如下。我正在使用requests/2.17.3。

使用默认的Python请求User-Agent

In [37]: rsp = requests.get('https://angel.co/bloom')
In [38]: rsp.status_code
Out[38]: 404

使用与Mozilla兼容的用户代理

In [39]: rsp = requests.get('https://angel.co/bloom', headers={'User-Agent': 'Mozilla/5.0'})

In [40]: rsp.status_code
Out[40]: 200

rsp.content包含您希望从angel.co/bloom中看到的内容。

使用一些随机的User-Agent

In [41]: rsp = requests.get('https://angel.co/bloom', headers={'User-Agent': 'birryree angel scraper'})

In [42]: rsp.status_code
Out[42]: 200

所以你应该设置User-Agent以通过任何类型的过滤/阻止天使用于各种默认代理。

如果你要进行大量的刮痧，我建议你做一个好公民，并设置一个代理字符串，让他们联系你，以防你的刮痧导致问题，如：

requests.get('https://angel.co/bloom', 
             headers={'User-Agent': 'Mozilla/5.0 (compatible; http://yoursite.com)'}

Answer 2

将标题添加到您可以访问该页面的requests参数。以下是“人们也看过”的结果。请尝试以下脚本：

import requests
from bs4 import BeautifulSoup

res = requests.get('https://angel.co/pen-io', headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text, 'html.parser')
for item in soup.select(".text"):
    try:
        title = item.select_one("a.startup-link").get_text()
    except:
        title = ''
    print(title)

结果：

Corilla
Pronoun
checkthis
Wattpad
Medium
Plympton
Cheezburger
AngelList

我正在尝试网络搜索http://angel.co/bloomfire

2 个答案:

使用默认的Python请求User-Agent

使用与Mozilla兼容的用户代理

使用一些随机的User-Agent