如何使用Python BS4访问HTML <p>标签中的文本

时间:2017-12-20 21:15:52

标签: python beautifulsoup

所以这是我的HTML:

enter image description here

我怎样才能访问&#34; CTS PAC即将到期。&#34;特别是在beautifulsoup中使用&#34; find_all&#34;命令?

2 个答案:

答案 0 :(得分:0)

from bs4 import BeautifulSoup

html = '''
<p>
    <strong>123</strong>
    A CTS PAC is nearing its expiration date
</p>
'''

soup = BeautifulSoup(html, 'html.parser')

p = soup.find('p')

text = list(p.children)[-1]

print(text.strip())

如果你有更多<p>

from bs4 import BeautifulSoup

html = '''
<p>
    Other p-tag
</p>
<p>
    <strong>123</strong>
    A CTS PAC is nearing its expiration date
</p>
'''

soup = BeautifulSoup(html, 'html.parser')

all_p = soup.find_all('p')


text = list(all_p[1].children)[-1]

print(text.strip())

答案 1 :(得分:0)

我相信这个HTML来自cisco.com网站。如果是这样,这里就是您问题的直接答案。

>>> url = 'https://www.cisco.com/c/en/us/td/docs/security/asa/syslog/b_syslog/syslogs10.html'
>>> import bs4
>>> import requests
>>> page = requests.get(url).content
>>> soup = bs4.BeautifulSoup(page, 'lxml')

首先,我尝试寻找朴素的字符串。但是,经过对页面的仔细检查,我注意到了一些尾随空白。

>>> near = soup.find_all(string='A CTS PAC is nearing its expiration date')
>>> near
[]

使用正则表达式可以在源页面中搜索带有尾随空白的字符串

>>> near = soup.find_all(string=bs4.re.compile('A CTS PAC is nearing its expiration date'))
>>> near
['A CTS PAC is nearing its expiration date.\n\t ']