我想在新闻页面的Most Read部分中提取标题。这是我到目前为止所拥有的,但我获得了所有的头衔。我只想要最多阅读部分中的那些。
`
import requests
from bs4 import BeautifulSoup
base_url = 'https://www.michigandaily.com/section/opinion'
r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html5lib")
for story_heading in soup.find_all(class_= "views-field views-field-title"):
if story_heading.a:
print(story_heading.a.text.replace("\n", " ").strip())
else:
print(story_heading.contents[0].strip())`
答案 0 :(得分:1)
您需要将范围限制为最常阅读文章的div容器。
import requests
from bs4 import BeautifulSoup
base_url = 'https://www.michigandaily.com/section/opinion'
r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html5lib")
most_read_soup = soup.find_all('div', {'class': 'view-id-most_read'})[0]
for story_heading in most_read_soup.find_all(class_= "views-field views-field-title"):
if story_heading.a:
print(story_heading.a.text.replace("\n", " ").strip())
else:
print(story_heading.contents[0].strip())
答案 1 :(得分:0)
您可以使用css选择器从最顶层的读取div中获取特定标记:
from bs4 import BeautifulSoup
base_url = 'https://www.michigandaily.com/section/opinion'
r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html5lib")
css = "div.pane-most-read-panel-pane-1 a"
links = [a.text.strip() for a in soup.select(css)]
哪个会给你:
[u'Michigan in Color: Anotha One', u'Migos trio ends 2016 SpringFest with Hill Auditorium concert', u'Migos dabs their way to a seminal moment for Ann Arbor hip hop', u'Best of Ann Arbor 2016', u'Best of Ann Arbor 2016: Full List']