Question

我想获取每个标签的全文内容。例如，如果我们有这样的内容：

html_code = """
<body>
    <h1>hello<b>there</b>how are you?</h1>
</body>"""

我想得到这个结果：

对于body标签：''（无-没有任何子元素）
对于h1标签：'hello there how are you?'（包含所有子元素）
对于b标签：'there'（包含所有子元素）

我尝试了很多事情，但是没有一个结果。有什么建议吗？

Answer 1

您必须使用.find()或.find_all()，然后无论是否要包含子标记，都可以使用recursive参数：

html_code = """
<body>
    <h1>hello<b>there</b>how are you?</h1>
</body>"""

import bs4

soup = bs4.BeautifulSoup(html_code, 'html.parser')


body_text = soup.body.find_all(text=True, recursive=False) 
h1_text = soup.h1.find_all(text=True, recursive=True) 
b_text = soup.b.find_all(text=True, recursive=False) 

body_text = ' '.join(body_text).strip()
h1_text = ' '.join(h1_text).strip()
b_text = ' '.join(b_text).strip()


print ('body: %s\nh1: %s\nb: %s' %(body_text, h1_text, b_text))

输出：

body: 
h1: hello there how are you?
b: there

获取html标签的文本内容，例如python的js textcontent属性

1 个答案: