Question

使用Beautiful Soup 4，我试图打印不带标签的h1内容。

我正在使用python 3.6和Beautiful Soup 4。

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    html = urlopen('https:/place_holder.com/')
    bs = BeautifulSoup(html.read(), 'html.parser')
    headings = bs.find_all('h1')
    print(headings)

预期结果：

第一个标题第二标题第三标题

实际结果：每个标题都以h1标签开头和结尾

Answer 1

这是一个骇人听闻的解决方案：

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https:/place_holder.com/')
bs = BeautifulSoup(html.read(), 'html.parser')
headings = bs.find_all('h1')

#New
headings = headings.replace('<h1>', '')
headings = headings.replace('</h1>', '')

print(headings)

在不必要的注释上：

您想要https://place_holder.com/

不是https:/place_holder.com/

Answer 2

您要寻找的关键方法是Tag.get_text()。

例如：

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://example.com/')
bs = BeautifulSoup(html.read(), 'html.parser')
headings = bs.find_all('h1')
for h in headings:
    print(h.get_text()) # This will print the text between the tags

打印没有标签的h1标题

2 个答案: