Question

我想同时从两个<p>抓取文字，我该如何获取？对于第一个<p>，我的代码可以正常工作，但是我无法获取第二个<p>。

  <p>
        <a href="https://www.japantimes.co.jp/news/2019/03/19/world/crime-legal-world/emerging-online-threats-changing-homeland-securitys-role-merely-fighting-terrorism/">
         Emerging online threats changing Homeland Security's role from merely fighting terrorism
        </a>
       </p>
      </hgroup>
     </header>
     <p>
      Homeland Security Secretary Kirstjen Nielsen said Monday that her department may have been founded to combat terrorism, but its mission is shifting to also confront emerging online threats.

    China, Iran and other countries are mimicking the approach that Russia used to interfere in the U.S. ...
      <a class="more_link" href="https://www.japantimes.co.jp/news/2019/03/19/world/crime-legal-world/emerging-online-threats-changing-homeland-securitys-role-merely-fighting-terrorism/">
       <span class="icon-arrow-2">
       </span>
      </a>
     </p>

我的代码是：

    from bs4 import BeautifulSoup
    ssl._create_default_https_context = ssl._create_unverified_context
    article = "https://www.japantimes.co.jp/tag/cybersecurity/page/1/"
    page = urllib.request.urlopen(article)
    soup = BeautifulSoup(page, 'html.parser')
    article = soup.find('div', class_="content_col")
    date = article.h3.find('span', class_= "right date")
    date = date.text
    headline = article.p.find('a')
    headline = headline.text
    content = article.p.text
    print(date, headline,content)

Answer 1

使用父id和p选择器，并在返回列表中索引所需的段落数。您可以在发布时使用时间标签

import requests 
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.japantimes.co.jp/news/2019/03/19/world/crime-legal-world/emerging-online-threats-changing-homeland-securitys-role-merely-fighting-terrorism/#.XJIQNDj7TX4')
soup = bs(r.content, 'lxml')
posted = soup.select_one('time').text
print(posted)
paras = [item.text.strip() for item in soup.select('#jtarticle p')]
print(paras[:2])

Answer 2

您可以使用.find_next()。但是，这不是全文：

from bs4 import BeautifulSoup
import requests


article = "https://www.japantimes.co.jp/tag/cybersecurity/page/1/"
page = requests.get(article)
soup = BeautifulSoup(page.text, 'html.parser')


article = soup.find('div', class_="content_col")

date = article.h3.find('span', class_= "right date")
date_text = date.text

headline = article.p.find('a')
headline_text = headline.text

content_text = article.p.find_next('p').text
print(date_text, headline_text ,content_text)

如何使用beautifulsoup4从两个<p>中访问文本？

2 个答案: