通过BeautifulSoup进行网页抓取

时间:2019-01-14 16:58:33

标签: python beautifulsoup

会出现以下错误-应提供公司名称-根据寻找以下标签的想法:

<span datatype="xsd:string" property="gazorg:name">ISCA SCAFFOLD LIMITED </span>

来自以下代码:

import requests
from bs4 import BeautifulSoup
data = requests.get('https://www.thegazette.co.uk/notice/3188283')
data.text[:1000]
soup = BeautifulSoup(data.text, 'html.parser')
soup.prettify()[:1000]
span = soup.find('span', {'property' : 'gazorg:name'})
company = span.text

错误:

AttributeError                            Traceback (most recent call last)
<ipython-input-7-4449f0e20d72> in <module>
----> 1 company = span.text
AttributeError: 'NoneType' object has no attribute 'text'`enter code here`

1 个答案:

答案 0 :(得分:1)

由于未设置User-Agent,因此出现该错误。网站可以根据用户代理选择做出不同的响应。如果缺少用户代理,则某些网站可能不会给出有效的响应。

建议将User-Agent设置为与检查站点时使用的代理类似。

import requests
from bs4 import BeautifulSoup
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
data = requests.get('https://www.thegazette.co.uk/notice/3188283',headers=headers)
soup = BeautifulSoup(data.text, 'html.parser')
span = soup.find('span', {'property' : 'gazorg:name'})
company = span.text
print(company)

输出

ISCA SCAFFOLD LIMITED