我想同时从两个<p>
抓取文字,我该如何获取?
对于第一个<p>
,我的代码可以正常工作,但是我无法获取第二个<p>
。
<p>
<a href="https://www.japantimes.co.jp/news/2019/03/19/world/crime-legal-world/emerging-online-threats-changing-homeland-securitys-role-merely-fighting-terrorism/">
Emerging online threats changing Homeland Security's role from merely fighting terrorism
</a>
</p>
</hgroup>
</header>
<p>
Homeland Security Secretary Kirstjen Nielsen said Monday that her department may have been founded to combat terrorism, but its mission is shifting to also confront emerging online threats.
China, Iran and other countries are mimicking the approach that Russia used to interfere in the U.S. ...
<a class="more_link" href="https://www.japantimes.co.jp/news/2019/03/19/world/crime-legal-world/emerging-online-threats-changing-homeland-securitys-role-merely-fighting-terrorism/">
<span class="icon-arrow-2">
</span>
</a>
</p>
我的代码是:
from bs4 import BeautifulSoup
ssl._create_default_https_context = ssl._create_unverified_context
article = "https://www.japantimes.co.jp/tag/cybersecurity/page/1/"
page = urllib.request.urlopen(article)
soup = BeautifulSoup(page, 'html.parser')
article = soup.find('div', class_="content_col")
date = article.h3.find('span', class_= "right date")
date = date.text
headline = article.p.find('a')
headline = headline.text
content = article.p.text
print(date, headline,content)
答案 0 :(得分:1)
使用父id和p选择器,并在返回列表中索引所需的段落数。您可以在发布时使用时间标签
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.japantimes.co.jp/news/2019/03/19/world/crime-legal-world/emerging-online-threats-changing-homeland-securitys-role-merely-fighting-terrorism/#.XJIQNDj7TX4')
soup = bs(r.content, 'lxml')
posted = soup.select_one('time').text
print(posted)
paras = [item.text.strip() for item in soup.select('#jtarticle p')]
print(paras[:2])
答案 1 :(得分:0)
您可以使用.find_next()
。但是,这不是全文:
from bs4 import BeautifulSoup
import requests
article = "https://www.japantimes.co.jp/tag/cybersecurity/page/1/"
page = requests.get(article)
soup = BeautifulSoup(page.text, 'html.parser')
article = soup.find('div', class_="content_col")
date = article.h3.find('span', class_= "right date")
date_text = date.text
headline = article.p.find('a')
headline_text = headline.text
content_text = article.p.find_next('p').text
print(date_text, headline_text ,content_text)