试图使用python从html代码中抓取数据

时间:2016-08-01 12:04:10

标签: python html web-scraping beautifulsoup

我一直在为一个网站制作一个网络刮板,我想使用.findall从一个html表中提取所有节点号或者可以工作的东西,但是我很难得到它,我经常遇到错误我是显然没有放入正确的标签。

任何人都可以提供帮助,html代码如下

</div>

<table class="dataTable" cellpadding="5" cellspacing="0" rules="all" border="1" id="ctl00_ContentPlaceHolder1_dgNodes" style="border-collapse:collapse;">
    <tr class="header noBreak">
        <td>&nbsp;</td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl00','')">Node Name</a></td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl01','')">Description</a></td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl02','')">MAC Address</a></td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl03','')"></a>
                <a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$liNodeRoleHeader','')" id="ctl00_ContentPlaceHolder1_dgNodes_ctl00_liNodeRoleHeader">Node Role</a>
            </td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl04','')">Firmware</a></td><td>
                <a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$lbUptimeHeader','')" id="ctl00_ContentPlaceHolder1_dgNodes_ctl00_lbUptimeHeader">Uptime</a>
            </td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl05','')">Users</a></td>
    </tr><tr onmouseover="this.className = 'highlightedRow';" onmouseout="this.className = 'normalRow';" onclick="GoToNodePage('522');" style="height:18px;">

我需要在最后一行代码中提取数字522以及所有其他getonodepage数字,但我无法弄清楚,任何帮助都表示赞赏。我还想将提取的数字放入以后使用的列表中。

r2 = s2.get(webpage)
bsobjswap = BeautifulSoup(r2.content)

listy = []
for link in bsobjswap.findall('tr'):
    if 'onclick' in link.attrs:
        listy.append(link)
print (listy)

错误是    对于bsobjswap.findall('tr')中的链接: TypeError:'NoneType'对象不可调用

1 个答案:

答案 0 :(得分:-1)

尝试这样的事情:

from bs4 import BeautifulSoup

xml = """<table class="dataTable" cellpadding="5" cellspacing="0" rules="all" border="1" id="ctl00_ContentPlaceHolder1_dgNodes" style="border-collapse:collapse;">
    <tr class="header noBreak">
        <td>&nbsp;</td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl00','')">Node Name</a></td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl01','')">Description</a></td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl02','')">MAC Address</a></td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl03','')"></a>
                <a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$liNodeRoleHeader','')" id="ctl00_ContentPlaceHolder1_dgNodes_ctl00_liNodeRoleHeader">Node Role</a>
            </td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl04','')">Firmware</a></td><td>
                <a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$lbUptimeHeader','')" id="ctl00_ContentPlaceHolder1_dgNodes_ctl00_lbUptimeHeader">Uptime</a>
            </td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl05','')">Users</a></td>
    </tr><tr onmouseover="this.className = 'highlightedRow';" onmouseout="this.className = 'normalRow';" onclick="GoToNodePage('522');" style="height:18px;">"""

soup = BeautifulSoup(xml)
print([i.get('onclick') for i in soup.findAll('tr', attrs={'onclick':True})])

这将返回["GoToNodePage('522');"]

从这里你可以用正则表达式提取数字,例如

print([re.findall("\d+", i.get('onclick')) for i in soup.findAll('tr', attrs={'onclick':True})])

这将返回[['522']]