使用Beautifulsoup

时间:2016-05-02 02:41:21

标签: python html beautifulsoup

我只是在学习Python,而且我花了好几个小时试图解决这个问题。基本上,我有一个重复结构的HTML文档,我试图从每次重复中提取出某些元素。我想出了如何拉出第一个元素,但我不能为我的生活想出拉其他任何一个。第一个很容易,因为它有一个独特的类,但其余的没有。在我疯了之前请帮忙。

以下是html的重复部分。我想拉出第一个标题,我能够做到。我也希望获得“概要”和“风险因素”。

<h2 xmlns="" class="classsection4" id="idp201558400">50044 (1) - Ubuntu 
6.06 LTS / 8.04 LTS / 9.04 / 9.10 / 10.04 LTS / 10.10 : linux, 
linux-ec2, linux-source-2.6.15 vulnerabilities (USN-1000-1)</h2>
<h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;">  
			<![endif]]]-->Synopsis</h2>
<span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">The remote Ubuntu host is missing one or more security-related patches.</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;">  
			<![endif]]]-->Description</h2>
<span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">This is some description text. 
(CVE-2010-NNN2).</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;">  
			<![endif]]]-->Solution</h2>
<span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">Update the affected packages.</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;">  
			<![endif]]]-->Risk Factor</h2>
<span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">Critical</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;">  
			<![endif]]]-->CVSS Base Score</h2>
<span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">10.0 (CVSS2#AV:N/AC:L/Au:N/C:C/I:C/A:C)</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;">  
			<![endif]]]-->CVSS Temporal Score</h2>
<span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">8.7 (CVSS2#E:ND/RL:OF/RC:ND)</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;">

这是我目前的代码:

import requests
from bs4 import BeautifulSoup
import urllib
import re

page = open("C:/Users/AlphaWP/Downloads/631_SupportingFiles4_Labs6-7/Nessus Vulnerability Scan.htm").read()

soup = BeautifulSoup(page, "html.parser")

for section in soup.findAll("h2",{"class":"classsection4"}):
    # nextNode = section
    # print(nextNode.name)
    # print(section)
    print(section.contents)
    print("##############################")
    # print(section.contents)
    for section1 in soup.findAll('h2', text=re.compile(r'Risk')):
        print(section1)
        riskFactor = section1.find("span")
        riskLevel = riskFactor.contents
        print(riskLevel)
    print("##############################")

1 个答案:

答案 0 :(得分:0)

要使用所有span元素:

spans = soup.find_all('span', {'class': 'classtext'})

spans现在是包含类classtext的所有span元素的列表。要访问Synopsis范围和Risk Factor范围:

>>> spans[0]
<span class="classtext" style="color: #263645; font-weight: normal;" xmlns="">The remote Ubuntu host is missing one or more security-related patches.</span>
>>> spans[3]
<span class="classtext" style="color: #263645; font-weight: normal;" xmlns="">Critical</span>