Question

I'm trying to scrape the body of this article: https://www.cnbc.com/2017/12/07/pinterest-hires-former-facebook-exec-gary-johnson-to-run-corporate-dev.html

的情况下刮刮

这通常是一个非常简单的find_all（'p'）但是我在避免（a）位于某些（p）中时遇到了一些麻烦。例如，当正文中的单词被超链接到另一个URL

时会发生这种情况

我想获得没有URL的文本。到目前为止，我有：

from bs4 import BeautifulSoup
import requests

html = requests.get("https://www.cnbc.com/2017/12/07/pinterest-hires-former-facebook-exec-gary-johnson-to-run-corporate-dev.html").text
soup = BeautifulSoup(html, 'html5lib')

all_paragraphs = soup.find_all('p')

如何在没有（a）和随后的URL的情况下从所有（p）中提取文本？

提前谢谢

Answer 1

要获取p内的所有文字（即使来自a）但没有这些标记，请使用.text或.get_text()

from bs4 import BeautifulSoup
import requests

html = requests.get("https://www.cnbc.com/2017/12/07/pinterest-hires-former-facebook-exec-gary-johnson-to-run-corporate-dev.html").text
soup = BeautifulSoup(html, 'html5lib')

all_paragraphs = soup.find_all('p')

for p in all_paragraphs:
    #print(p) # all HTML
    print(p.get_text()) # p.get_text(strip=True)
    # or
    print(p.text)

如果您希望p中没有文字a，则必须在获取文字前删除a

for p in all_paragraphs:
    for a in p.find_all('a'):
       a.extract()
    print(p.text)

在没有嵌入式<a>

1 个答案: