Question

我正在抓取多个网页，但是某些内容/文本带有div标签而不是p或span的网站遇到了问题。以前，脚本可以很好地从p和span标签获取文本，但是如果代码片段如下所示：

<div>Hello<p>this is a test</p></div>

使用find_all（'div'）和.getText（）提供以下输出：

Hello this is a test

我希望得到Hello的结果。这将使我能够确定哪些标签中包含什么内容。我尝试使用recursive = False，但是在包含多个包含内容的div标签的整个网页上，这似乎不起作用。

添加的代码段

req = urllib.request.Request("https://www.healthline.com/health/fitness-exercise/pushups-everyday", headers={'User-Agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read().decode("utf-8").lower()
soup = BeautifulSoup(html, 'html.parser')
divTag = soup.find_all('div')
text = []
for div in divTag:
    i = div.getText()
    text.append(i)
print(text)

谢谢。

Answer 1

根据您的信息，请在此处回答：how to get text from within a tag, but ignore other child tags

这将导致如下情况：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for div in soup.find_all('div'):
    print(div.find(text=True, recursive=False))

编辑：您只需要更改

i = div.getText()

到

i = div.find(text=True, recursive=False)

Answer 2

这是一个可能的解决方案，我们从汤中提取所有'p'。

from bs4 import BeautifulSoup
html = "<div>Hello<p>this is a test</p></div>"
soup = BeautifulSoup(html, 'html.parser')
for p in soup.find('p'):
    p.extract()
print(soup.text)

BS4从所有DIV标签中获取文本，但不能从子标签中获取文本

2 个答案: