我正在尝试抓取新闻网站的数据,现在我需要p标签中的文本。
我在Google上搜索了很多,但是所有解决方案要么返回“ None”,要么引发此错误:
Traceback (most recent call last):
File "E:/Python/News Uploader to Google Driver/venv/Scripts/main.py", line 41, in <module>
contents = parse(text)
File "E:/Python/News Uploader to Google Driver/venv/Scripts/main.py", line 28, in parse
article = soup.find("div", {"class": "content_text row description"}).findAll('p')
AttributeError: 'NoneType' object has no attribute 'findAll
def parse(url):
html = requests.get(url)
#array_of_paragraphs = [""]
soup = BeautifulSoup(html.content, 'html5lib')
text = []
text = soup.find("div", {"class": "content_text row description"}).findAll('p')
for t in text:
text = ''.join(element.findAll(text=True))
return text
您可以将其用于测试目的
除了“无”消息或错误外,控制台上什么都没有显示
答案 0 :(得分:0)
尝试以下简化版本:
def parse(url):
html = requests.get(url)
soup = BeautifulSoup(html.content, 'html5lib')
final_text = []
inf= soup.find('div', class_="content_text row description")
info = inf.find_all('p')
for i in info:
final_text.append(i.text)
return final_text
输出是<p>
标记之间(目标div
内部)的所有内容。
答案 1 :(得分:0)
将子级p添加到由类定义的父级
import requests
from bs4 import BeautifulSoup as bs
headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get('https://gadgets.ndtv.com/mobiles/news/samsung-galaxy-a-series-56-percent-q2-smartphone-sales-share-counterpoint-2112319', headers = headers)
soup = bs(r.content, 'lxml')
print('\n'.join([i.text for i in soup.select('.description p')]))
import requests
from bs4 import BeautifulSoup as bs
def parse(url):
headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get(url, headers = headers)
soup = bs(r.content, 'lxml')
text = '\n'.join([i.text for i in soup.select('.description p')])
return text
parse('https://gadgets.ndtv.com/mobiles/news/samsung-galaxy-a-series-56-percent-q2-smartphone-sales-share-counterpoint-2112319')