如何从<p>元素“ id”中抓取文本

时间:2018-07-06 10:29:39

标签: python web-scraping beautifulsoup nonetype

我正在学习如何抓取,那么我还不是很高级。我从彭博社刮掉公司的介绍。 例如从此页面(https://www.bloomberg.com/research/stocks/private/snapshot.asp?privcapId=320105

我想抓

<p id="bDescTeaser" itemprop="description">Fiat Chrysler Automobiles N.V., ...</p>


我的脚本:

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
html= 'https://www.bloomberg.com/research/stocks/private/snapshot.asp? 
privcapId=32010'
page = urlopen(html)
data = BeautifulSoup(page, 'html.parser')
text = data.find('p',id="bDescTeaser",itemprop="  ")
print(text.get_text))

如果我尝试运行得到的程序,

AttributeError: 'NoneType' object has no attribute 'get_text'

这是我的代码还是特定的Webapge问题?

2 个答案:

答案 0 :(得分:1)

在您的解决方案中,彭博阻止您的请求。因为它认为您是机器人。 您应该使用请求库并将用户代理发送为标头。这样您将获得预期的输出。

import requests
from bs4 import BeautifulSoup

header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0'}
request = requests.get('https://www.bloomberg.com/research/stocks/private/snapshot.asp?privcapId=320105',headers=header)
soup = BeautifulSoup(request.text, 'html.parser')    
text = soup.find('p',id="bDescTeaser")
print(text.get_text())

答案 1 :(得分:0)

必须给get_text()开括号。将其从get_text)更改为get_text()