试图从除了br,PYTHON 3之外没有任何特殊标签的html中抓取文本

时间:2016-08-09 09:33:36

标签: html python-3.x web-scraping beautifulsoup

所以我一直在为我的公司网站制作一个抓取程序,但是我遇到了一个问题,基本上我需要从html表中删除测试但是我无法获取所需的数据。

HTML CODE

    <div>
    <table class="style3" cellspacing="0" rules="all" border="1" id="ctl00_cpMainContent_gvNodes" style="border-color:White;border-style:None;width:1090px;border-collapse:collapse;">
        <tr>
            <th scope="col">History</th>
        </tr><tr>
            <td style="color:White;background-color:White;font-size:11pt;font-weight:bold;"> </td>
        </tr><tr>
            <td style="color:White;background-color:Blue;border-color:Black;border-style:Inset;font-size:12pt;font-weight:normal;">date updated: 02/01/2014 21:42:52   |   By: jakubkwasny   |   Status: Resolved</td>
        </tr><tr>
            <td style="color:Black;background-color:LightSkyBlue;border-color:LightSkyBlue;font-size:12pt;font-weight:normal;"><br />Root Cause: Hardware Failure<br />Action Completed: Power supply/filter/cable swap<br /><br />Arrival Time: 02/01/2014 15:54:17<br />Leaving Time: 02/01/2014 16:27:44<br />Was the job successful: Yes<br /><br /><br />Notes:replaced dsl cable and filter. Also rebooted all equipment. All working fine now.<br />Next Action required:none<br />Added by jakubkwasny at 02/01/2014 21:41:40<br /><br />Pinging 99.99.99.99 with 32 bytes of data:<br />Reply from 99.99.99.99: bytes=32 time=67ms TTL=240<br />Reply from 99.999.999.99: bytes=32 time=92ms TTL=240<br />Reply from 99.99.65.65: bytes=32 time=76ms TTL=240<br />Reply from 67.45.32.12: bytes=32 time=82ms TTL=240<br /><br />Ping statistics for 12.12.12.12:<br />Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),<br />Approximate round trip times in milli-seconds:<br />Minimum = 67ms, Maximum = 92ms, Average = 79ms</td>
        </tr><tr>
            <td style="color:White;background-color:White;font-size:11pt;font-weight:bold;"> </td>

我需要能够在br标签内抓取数据,例如附加到第三个td标签的数据,我已经设法从表中获取所有数据但是无法弄清楚如何获取特定行然后获取这些内容在br标签

CODE SNIPPET

bsobjswap = BeautifulSoup(r2.content)
print (bsobjswap.find('table',{'id':'ctl00_cpMainContent_gvNodes'}).find("style",{"color":"Black"}))

这是我最近的尝试,但不起作用。任何帮助表示赞赏

更多数据

<div id="ctl00_cpMainContent_upNodes">

    <div>
    <table class="style3" cellspacing="0" rules="all" border="1" id="ctl00_cpMainContent_gvNodes" style="border-color:White;border-style:None;width:1090px;border-collapse:collapse;">
        <tr>
            <th scope="col">History</th>
        </tr><tr>
            <td style="color:White;background-color:White;font-size:11pt;font-weight:bold;"> </td>
        </tr><tr>
            <td style="color:White;background-color:Blue;border-color:Black;border-style:Inset;font-size:12pt;font-weight:normal;">date updated: 02/01/2014 21:21:16   |   By: jakubkwasny   |   Status: Resolved</td>
        </tr><tr>
            <td style="color:Black;background-color:LightSkyBlue;border-color:LightSkyBlue;font-size:12pt;font-weight:normal;"><br />Root Cause: Core / Authentication issue<br />Action Completed: No site visit required<br /><br />Hi Chris,<br /><br />There were no faults detected. As installation have been done recently, Lancom uses 2.05 configuration script. Our engineer was unable to see landing page, he was getting connected to the Internet with. I contacted Picopoint who informed me that this is due the fact that their system remembers MAC addresses of the devices that were logged into the system hence no landing page is needed. It have been confirmed by removing MAC addresses of the engineer's devices from the database. By doing so engineer was able to access the landing page again. Picopoint's engineer checked the configuration of the devices at both ends and haven't detected any problems. At the moment we are unable to state what are the issues with venue as we haven't experienced any. <br /><br />Arrival Time: 02/01/2014 16:19:23<br />Leaving Time: 02/01/2014 17:51:18<br />Was the job successful: Yes<br /><br /><br />Notes:Still physically missing lines 3 and 4. See screen shot.<br />Line 6 has a dial tone BUT no dsl is present on line.<br />Still getting some landing page errors.. My laptop now seems to work but my android phone justs connects to google with no landing page .<br /><br />Screen shots included but couldnt access youtube ( was recieveing an block ID error )<br />ASDA resriction ?<br /><br />Picopoint still looking into problem according to Jakub<br /><br />Next Action required:Ask Jakub<br />Added by jakubkwasny at 02/01/2014 21:10:12<br /><br />Pinging 11.11.11.11 with 32 bytes of data:<br />Reply from 11.11.11.11: bytes=32 time=47ms TTL=50<br />Reply from 11.11.11.11: bytes=32 time=38ms TTL=50<br />Reply from 11.11.11.11: bytes=32 time=39ms TTL=50<br />Reply from 11.11.11.11: bytes=32 time=41ms TTL=50<br /><br />Ping statistics for 11.11.11.11:<br />Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),<br />Approximate round trip times in milli-seconds:<br />Minimum = 38ms, Maximum = 47ms, Average = 41ms</td>
        </tr><tr>
            <td style="color:White;background-color:White;font-size:11pt;font-weight:bold;"> </td>

我的代码访问了数千个页面并查看它,每个表遵循相同的模式,我猜我将始终需要来自第三个td标记的数据,但不知道如何获取它。

干杯

1 个答案:

答案 0 :(得分:0)

这个怎么样:

from bs4 import BeautifulSoup

html = """(your html from the example above)"""

soup = BeautifulSoup(html, 'html.parser')

row_data = soup.find('td', {'style':'color:Black;background-color:LightSkyBlue;border-color:LightSkyBlue;font-size:12pt;font-weight:normal;'})

clean_data = str(row_data).replace('<td style="color:Black;background-color:LightSkyBlue;border-color:LightSkyBlue;font-size:12pt;font-weight:normal;">','')\
    .replace('</td>','')

print('\n'.join([x for x in clean_data.split('<br/>') if x != '']))

"""
Generated output:

Root Cause: Hardware Failure
Action Completed: Power supply/filter/cable swap
Arrival Time: 02/01/2014 15:54:17
Leaving Time: 02/01/2014 16:27:44
Was the job successful: Yes
Notes:replaced dsl cable and filter. Also rebooted all equipment. All working fine now.
Next Action required:none
Added by jakubkwasny at 02/01/2014 21:41:40
Pinging 99.99.99.99 with 32 bytes of data:
Reply from 99.99.99.99: bytes=32 time=67ms TTL=240
Reply from 99.999.999.99: bytes=32 time=92ms TTL=240
Reply from 99.99.65.65: bytes=32 time=76ms TTL=240
Reply from 67.45.32.12: bytes=32 time=82ms TTL=240
Ping statistics for 12.12.12.12:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 67ms, Maximum = 92ms, Average = 79ms
"""