我是python的新手,我正试图抓一个网站。 我能够登录到一个网站并获得一个html页面,但我不需要整个页面,我只需要指定表格中的超链接。
我编写了以下代码,但这会获得所有超链接。
soup = BeautifulSoup(the_page)
for table in soup.findAll('table',{'id':'ctl00_Main_lvMyAccount_Table1'} ):
for link in soup.findAll('a'):
print link.get('href')
任何人都可以帮助我,我哪里出错了?
下面是表格的html文字
<table id="ctl00_Main_lvMyAccount_Table1" width="680px">
<tr id="ctl00_Main_lvMyAccount_Tr1">
<td id="ctl00_Main_lvMyAccount_Td1">
<table id="ctl00_Main_lvMyAccount_itemPlaceholderContainer" border="1" cellspacing="0" cellpadding="3">
<tr id="ctl00_Main_lvMyAccount_Tr2" style="background-color:#0090dd;">
<th id="ctl00_Main_lvMyAccount_Th1"></th>
<th id="ctl00_Main_lvMyAccount_Th2">
<a id="ctl00_Main_lvMyAccount_SortByAcctNum" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByAcctNum','')">
<font color=white>
<span id="ctl00_Main_lvMyAccount_AcctNum">Account number</span>
</font>
</a>
</th>
<th id="ctl00_Main_lvMyAccount_Th4">
<a id="ctl00_Main_lvMyAccount_SortByServAdd" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByServAdd','')">
<font color=white>
<span id="ctl00_Main_lvMyAccount_ServiceAddress">Service address</span>
</font>
</a>
</th>
<th id="ctl00_Main_lvMyAccount_Th5">
<a id="ctl00_Main_lvMyAccount_SortByAcctName" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByAcctName','')">
<font color=white>
<span id="ctl00_Main_lvMyAccount_AcctName">Name</span>
</font>
</a>
</th>
<th id="ctl00_Main_lvMyAccount_Th6">
<a id="ctl00_Main_lvMyAccount_SortByStatus" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByStatus','')">
<font color=white>
<span id="ctl00_Main_lvMyAccount_AcctStatus">Account status</span>
</font>
</a>
</th>
<th id="ctl00_Main_lvMyAccount_Th3"></th>
</tr>
<tr>
<td>
提前致谢。
答案 0 :(得分:1)
嗯,这是正确的方法。
soup = BeautifulSoup(the_page)
for table in soup.findAll('table',{'id':'ctl00_Main_lvMyAccount_Table1'} ):
for link in table.findAll('a'): #search for links only in the table
print link['href'] #get the href attribute
此外,您可以跳过父循环,因为指定的ID只有一个匹配项:
soup = BeautifulSoup(the_page)
table = soup.find('table',{'id':'ctl00_Main_lvMyAccount_Table1'})
for link in table.findAll('a'): #search for links only in the table
print link['href'] #get the href attribute
更新:注意到@DSM所说的内容。修复了表格分配中缺少的引用。
答案 1 :(得分:0)
确保你的for循环在表格html(而不是soup
变量,即页面html)中查找:
from bs4 import BeautifulSoup
page = BeautifulSoup(the_page)
table = page.find('table', {'id': 'ctl00_Main_lvMyAccount_Table1'})
links = table.findAll('a')
# Print href
for link in links:
link['href']
In [8]: table = page.find('table', {'id' : 'ctl00_Main_lvMyAccount_Table1'})
In [9]: links = table.findAll('a')
In [10]: for link in links:
....: print link['href']
....:
javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByAcctNum','')
javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByServAdd','')
javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByAcctName','')
javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByStatus','')
答案 2 :(得分:0)
您的嵌套循环for link in soup.findAll('a'):
正在搜索整个HTML页面。
如果要搜索表格中的链接,请将该行更改为:
for link in table.findAll('a'):