我只是在学习Python,而且我花了好几个小时试图解决这个问题。基本上,我有一个重复结构的HTML文档,我试图从每次重复中提取出某些元素。我想出了如何拉出第一个元素,但我不能为我的生活想出拉其他任何一个。第一个很容易,因为它有一个独特的类,但其余的没有。在我疯了之前请帮忙。
以下是html的重复部分。我想拉出第一个标题,我能够做到。我也希望获得“概要”和“风险因素”。
<h2 xmlns="" class="classsection4" id="idp201558400">50044 (1) - Ubuntu
6.06 LTS / 8.04 LTS / 9.04 / 9.10 / 10.04 LTS / 10.10 : linux,
linux-ec2, linux-source-2.6.15 vulnerabilities (USN-1000-1)</h2>
<h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;">
<![endif]]]-->Synopsis</h2>
<span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">The remote Ubuntu host is missing one or more security-related patches.</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;">
<![endif]]]-->Description</h2>
<span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">This is some description text.
(CVE-2010-NNN2).</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;">
<![endif]]]-->Solution</h2>
<span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">Update the affected packages.</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;">
<![endif]]]-->Risk Factor</h2>
<span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">Critical</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;">
<![endif]]]-->CVSS Base Score</h2>
<span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">10.0 (CVSS2#AV:N/AC:L/Au:N/C:C/I:C/A:C)</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;">
<![endif]]]-->CVSS Temporal Score</h2>
<span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">8.7 (CVSS2#E:ND/RL:OF/RC:ND)</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;">
这是我目前的代码:
import requests
from bs4 import BeautifulSoup
import urllib
import re
page = open("C:/Users/AlphaWP/Downloads/631_SupportingFiles4_Labs6-7/Nessus Vulnerability Scan.htm").read()
soup = BeautifulSoup(page, "html.parser")
for section in soup.findAll("h2",{"class":"classsection4"}):
# nextNode = section
# print(nextNode.name)
# print(section)
print(section.contents)
print("##############################")
# print(section.contents)
for section1 in soup.findAll('h2', text=re.compile(r'Risk')):
print(section1)
riskFactor = section1.find("span")
riskLevel = riskFactor.contents
print(riskLevel)
print("##############################")
答案 0 :(得分:0)
要使用所有span元素:
spans = soup.find_all('span', {'class': 'classtext'})
spans
现在是包含类classtext
的所有span元素的列表。要访问Synopsis
范围和Risk Factor
范围:
>>> spans[0]
<span class="classtext" style="color: #263645; font-weight: normal;" xmlns="">The remote Ubuntu host is missing one or more security-related patches.</span>
>>> spans[3]
<span class="classtext" style="color: #263645; font-weight: normal;" xmlns="">Critical</span>