我一直在为一个网站制作一个网络刮板,我想使用.findall从一个html表中提取所有节点号或者可以工作的东西,但是我很难得到它,我经常遇到错误我是显然没有放入正确的标签。
任何人都可以提供帮助,html代码如下
</div>
<table class="dataTable" cellpadding="5" cellspacing="0" rules="all" border="1" id="ctl00_ContentPlaceHolder1_dgNodes" style="border-collapse:collapse;">
<tr class="header noBreak">
<td> </td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl00','')">Node Name</a></td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl01','')">Description</a></td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl02','')">MAC Address</a></td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl03','')"></a>
<a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$liNodeRoleHeader','')" id="ctl00_ContentPlaceHolder1_dgNodes_ctl00_liNodeRoleHeader">Node Role</a>
</td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl04','')">Firmware</a></td><td>
<a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$lbUptimeHeader','')" id="ctl00_ContentPlaceHolder1_dgNodes_ctl00_lbUptimeHeader">Uptime</a>
</td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl05','')">Users</a></td>
</tr><tr onmouseover="this.className = 'highlightedRow';" onmouseout="this.className = 'normalRow';" onclick="GoToNodePage('522');" style="height:18px;">
我需要在最后一行代码中提取数字522以及所有其他getonodepage数字,但我无法弄清楚,任何帮助都表示赞赏。我还想将提取的数字放入以后使用的列表中。
r2 = s2.get(webpage)
bsobjswap = BeautifulSoup(r2.content)
listy = []
for link in bsobjswap.findall('tr'):
if 'onclick' in link.attrs:
listy.append(link)
print (listy)
错误是 对于bsobjswap.findall('tr')中的链接: TypeError:'NoneType'对象不可调用
答案 0 :(得分:-1)
尝试这样的事情:
from bs4 import BeautifulSoup
xml = """<table class="dataTable" cellpadding="5" cellspacing="0" rules="all" border="1" id="ctl00_ContentPlaceHolder1_dgNodes" style="border-collapse:collapse;">
<tr class="header noBreak">
<td> </td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl00','')">Node Name</a></td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl01','')">Description</a></td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl02','')">MAC Address</a></td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl03','')"></a>
<a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$liNodeRoleHeader','')" id="ctl00_ContentPlaceHolder1_dgNodes_ctl00_liNodeRoleHeader">Node Role</a>
</td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl04','')">Firmware</a></td><td>
<a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$lbUptimeHeader','')" id="ctl00_ContentPlaceHolder1_dgNodes_ctl00_lbUptimeHeader">Uptime</a>
</td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl05','')">Users</a></td>
</tr><tr onmouseover="this.className = 'highlightedRow';" onmouseout="this.className = 'normalRow';" onclick="GoToNodePage('522');" style="height:18px;">"""
soup = BeautifulSoup(xml)
print([i.get('onclick') for i in soup.findAll('tr', attrs={'onclick':True})])
这将返回["GoToNodePage('522');"]
从这里你可以用正则表达式提取数字,例如
print([re.findall("\d+", i.get('onclick')) for i in soup.findAll('tr', attrs={'onclick':True})])
这将返回[['522']]