我正在编写一个简单的python web scraper。我想从网站上提取一些具体信息,网站freedns.afraid.org有公共和私人域名表,我希望只删除公共域名。
为了让您更好地理解,以下是该页面的摘录:
<center>
<table border=0 width=90%>
<form action=/domain/registry/>
<input type=hidden name=sort value=5>
<tr>
<td bgcolor=cornflowerblue align=right colspan=4>
<table width=100% cellpadding=0 cellspacing=0>
<tr>
<td align=center><font face="verdana, Helvetica, Arial" size="2" color="white">Showing <b>201</b>-<b>300</b> of <b>89,317</b> total</font>
</td>
<td align=right>
<input type=text name=q value="">
<input style="background:orange;color:white;" type=submit name=submit value=SEARCH>
</td>
</tr>
</table>
</td>
</tr>
</form>
<tr>
<td bgcolor=cornflowerblue><a href="?sort=1&q="><font face="verdana, Helvetica, Arial" size="2" color="white">Domain</font></a>
</td>
<td bgcolor=cornflowerblue><a href="?sort=2&q="><font face="verdana, Helvetica, Arial" size="2" color="white">Status</font></a>
</td>
<td bgcolor=cornflowerblue><a href="?sort=3&q="><font face="verdana, Helvetica, Arial" size="2" color="white">Owner</font></a>
</td>
<td bgcolor=cornflowerblue><a href="?sort=4&q="><font face="verdana, Helvetica, Arial" size="2" color="white">Age</font></a>
</td>
</tr>
<tr>
<td colspan=4 bgcolor=#cccccc><font face="verdana, Helvetica, Arial" size="2" color="black">Sorted by: <b><u>Popularity</u></b></font>
</td>
</tr>
<tr class="trl">
<td><a href=/subdomain/edit.php?edit_domain_id=657177>arno.fi</a>
<br><span> (315 hosts in use) <a target=_blank rel="nofollow" href=http://www.arno.fi/>website</a></span>
</td>
<td>public</td>
<td><a href=/tools/contact.php?user_id=661291&subject=arno.fi>kajarno</a>
</td>
<td>1656 days ago (03/19/2011)</td>
</tr>
<tr class="trd">
<td><a href=/subdomain/edit.php?edit_domain_id=697217>orenznakomstva.ru</a>
<br><span> (314 hosts in use) <a target=_blank rel="nofollow" href=http://www.orenznakomstva.ru/>website</a></span>
</td>
<td>public</td>
<td><a href=/tools/contact.php?user_id=642020&subject=orenznakomstva.ru>igor67</a>
</td>
<td>1500 days ago (07/25/2011)</td>
</tr>
<tr class="trl">
<td><a href=/subdomain/edit.php?edit_domain_id=82359>4040.idv.tw</a>
<br><span> (313 hosts in use) <a target=_blank rel="nofollow" href=http://www.4040.idv.tw/>website</a></span>
</td>
<td>public</td>
<td><a href=/tools/contact.php?user_id=163078&subject=4040.idv.tw>paulliu</a>
</td>
<td>3605 days ago (10/31/2005)</td>
</tr>
<tr class="trd">
<td><a href=/subdomain/edit.php?edit_domain_id=564813>remoteaccess.me</a>
<br><span> (313 hosts in use) <a target=_blank rel="nofollow" href=http://www.remoteaccess.me/>website</a></span>
</td>
<td>private</td>
<td><a href=/tools/contact.php?user_id=470899&subject=remoteaccess.me>theosophia</a>
</td>
<td>1790 days ago (11/07/2010)</td>
</tr>
<tr class="trl">
<td><a href=/subdomain/edit.php?edit_domain_id=710791>stes.fi</a>
<br><span> (311 hosts in use) <a target=_blank rel="nofollow" href=http://www.stes.fi/>website</a></span>
</td>
<td>private</td>
<td><a href=/tools/contact.php?user_id=794524&subject=stes.fi>jkortela</a>
</td>
<td>1496 days ago (08/25/2011)</td>
</tr>
<tr class="trd">
<td><a href=/subdomain/edit.php?edit_domain_id=841652>teamspeak.bz</a>
<br><span> (308 hosts in use) <a target=_blank rel="nofollow" href=http://www.teamspeak.bz/>website</a></span>
</td>
<td>public</td>
<td><a href=/tools/contact.php?user_id=373066&subject=teamspeak.bz>riki123123</a>
</td>
<td>1061 days ago (11/05/2012)</td>
</tr>
<tr class="trl">
<td><a href=/subdomain/edit.php?edit_domain_id=238770>hs.vc</a>
<br><span> (307 hosts in use) <a target=_blank rel="nofollow" href=http://www.hs.vc/>website</a></span>
</td>
<td>public</td>
<td><a href=/tools/contact.php?user_id=382500&subject=hs.vc>xpuctoc</a>
</td>
<td>2730 days ago (04/12/2008)</td>
</tr>
<tr class="trd">
<td><a href=/subdomain/edit.php?edit_domain_id=728119>oneindonesia.co.id</a>
<br><span> (307 hosts in use) <a target=_blank rel="nofollow" href=http://www.oneindonesia.co.id/>website</a></span>
</td>
<td>public</td>
<td><a href=/tools/contact.php?user_id=822755&subject=oneindonesia.co.id>basarah</a>
</td>
<td>1448 days ago (10/11/2011)</td>
这是我的剧本:
#!/bin/python
from bs4 import BeautifulSoup
import requests
response = requests.get('http://freedns.afraid.org/domain/registry/page-1.html')
soup = BeautifulSoup(response.text, 'html.parser')
pricing = soup.find(id = 'pricing')
first_column = pricing.find('centre', {'border': '0'})
for li in first_column.find('tr', {'class': 'trl'}):
if 'public' in str(li).lower():
public = li.find('a').text
print(public)
返回错误:
Traceback (most recent call last):
File "afraid.org-scrape", line 9, in <module>
first_column = pricing.find('centre', {'border': '0'})
AttributeError: 'NoneType' object has no attribute 'find'
如何提取公开域名列表以及正在使用的主机数量,并打印到STDOUT或文本文件?
我希望输出清晰,清晰,简洁,没有额外的混乱。
答案 0 :(得分:0)
pricing = soup.find(id = 'pricing')
first_column = pricing.find('centre', {'border': '0'})
使用这样的东西
first_column = soup.find('table', {'border': '0'})