Python BeautifulSoup从html文件p标签中提取内容,该标签位于组div标签中。我打印出空白

时间:2016-08-12 14:31:06

标签: python-2.7 beautifulsoup

我正在尝试从我的Selenium Test Report html文件中提取一些数据。 我打算将空白打印到PyCharm控制台。 我想从P标签获取所有数据。这是一个div标签。

HTML代码段为:

<div class='heading'>
<h1>Test Report</h1>
<p class='attribute'><strong>Start Time:</strong> 2016-08-12 11:57:33</p>
<p class='attribute'><strong>Duration:</strong> 0:48:09.007000</p>
<p class='attribute'><strong>Status:</strong> Pass 75</p>

<p class='description'>Selenium - ClearCore 501 Regression edit project automated test</p>
</div>

首先,我首先尝试将Start Time取出,然后查看是否可以将值打印到控制台。我什么都没打印出来。 我想得到描述,Selenium - ClearCore 501回归编辑项目自动化测试

我的代码是:

from bs4 import BeautifulSoup

def extract_data_from_report_htmltestrunner():
    filename = (r"C:\share\ClearCore501_Automated_GUI_TestReport.html")
    html_report_part = open(filename,'r')
    soup = BeautifulSoup(html_report_part, "html.parser")
    div_heading = soup.find('div', {'class': 'heading'})
    p = div_heading.find('p', text='Start Time:')
    print "test"
    print p

我已添加:

if __name__ == "__main__":
extract_data_from_report_htmltestrunner()

我现在得到的输出是:

test
None

请问我做错了什么?

谢谢Riaz

1 个答案:

答案 0 :(得分:2)

文本位于 strong 标记中,而不是* p,因此找到该文本并调用 .parent 以获取 p 标记:

In [10]: html = """<div class='heading'>
   ....: <h1>Test Report</h1>
   ....: <p class='attribute'><strong>Start Time:</strong> 2016-08-12 11:57:33</p>
   ....: <p class='attribute'><strong>Duration:</strong> 0:48:09.007000</p>
   ....: <p class='attribute'><strong>Status:</strong> Pass 75</p>
   ....: 
   ....: <p class='description'>Selenium - ClearCore 501 Regression edit project automated test</p>
   ....: </div>"""

In [11]: from bs4 import BeautifulSoup

In [12]: soup = BeautifulSoup(html, "html.parser")

In [13]: div_heading = soup.find('div', {'class': 'heading'})

In [14]: p = div_heading.find('strong', text='Start Time:').parent

In [15]: print p
<p class="attribute"><strong>Start Time:</strong> 2016-08-12 11:57:33</p>

要获取描述,请使用类名:

In [16]: div_heading.find("p", class_="description")
Out[16]: <p class="description">Selenium - ClearCore 501 Regression edit project automated test</p>
In [17]: div_heading.find("p", class_="description").text
Out[17]: u'Selenium - ClearCore 501 Regression edit project automated test'

如果你只想要日期,请调用 p.find(text = True,recursive = False),这样你就不会从任何孩子那里得到文本。

In [18]: p = div_heading.find('strong', text='Start Time:').parent

In [19]: p.find(text=True, recursive=False)
Out[19]: u' 2016-08-12 11:57:33'
In [20]: p.text
Out[20]: u'Start Time: 2016-08-12 11:57:33'

您可以在两种方法中看到上述差异。只需在强标记上调用.text即可为您提供 u'Start Time:'

In [21]:  div_heading.find('strong', text='Start Time:').text
Out[21]: u'Start Time:'