从网页中的表格中提取特定信息

时间:2015-10-04 09:55:12

标签: python web web-crawler bots

我正在编写一个简单的python web scraper。我想从网站上提取一些具体信息,网站freedns.afraid.org有公共和私人域名表,我希望只删除公共域名。

为了让您更好地理解,以下是该页面的摘录:

<center>
  <table border=0 width=90%>
    <form action=/domain/registry/>
    <input type=hidden name=sort value=5>
    <tr>
      <td bgcolor=cornflowerblue align=right colspan=4>
        <table width=100% cellpadding=0 cellspacing=0>
          <tr>
            <td align=center><font face="verdana, Helvetica, Arial" size="2" color="white">Showing <b>201</b>-<b>300</b> of <b>89,317</b> total</font>
            </td>
            <td align=right>
              <input type=text name=q value="">
              <input style="background:orange;color:white;" type=submit name=submit value=SEARCH>
            </td>
          </tr>
        </table>
      </td>
    </tr>
    </form>
    <tr>
      <td bgcolor=cornflowerblue><a href="?sort=1&q="><font face="verdana, Helvetica, Arial" size="2" color="white">Domain</font></a>
      </td>
      <td bgcolor=cornflowerblue><a href="?sort=2&q="><font face="verdana, Helvetica, Arial" size="2" color="white">Status</font></a>
      </td>
      <td bgcolor=cornflowerblue><a href="?sort=3&q="><font face="verdana, Helvetica, Arial" size="2" color="white">Owner</font></a>
      </td>
      <td bgcolor=cornflowerblue><a href="?sort=4&q="><font face="verdana, Helvetica, Arial" size="2" color="white">Age</font></a>
      </td>
    </tr>
    <tr>
      <td colspan=4 bgcolor=#cccccc><font face="verdana, Helvetica, Arial" size="2" color="black">Sorted by: <b><u>Popularity</u></b></font>
      </td>
    </tr>
    <tr class="trl">
      <td><a href=/subdomain/edit.php?edit_domain_id=657177>arno.fi</a>
        <br><span> (315 hosts in use) <a target=_blank rel="nofollow" href=http://www.arno.fi/>website</a></span>
      </td>
      <td>public</td>
      <td><a href=/tools/contact.php?user_id=661291&subject=arno.fi>kajarno</a>
      </td>
      <td>1656 days ago (03/19/2011)</td>
    </tr>
    <tr class="trd">
      <td><a href=/subdomain/edit.php?edit_domain_id=697217>orenznakomstva.ru</a>
        <br><span> (314 hosts in use) <a target=_blank rel="nofollow" href=http://www.orenznakomstva.ru/>website</a></span>
      </td>
      <td>public</td>
      <td><a href=/tools/contact.php?user_id=642020&subject=orenznakomstva.ru>igor67</a>
      </td>
      <td>1500 days ago (07/25/2011)</td>
    </tr>
    <tr class="trl">
      <td><a href=/subdomain/edit.php?edit_domain_id=82359>4040.idv.tw</a>
        <br><span> (313 hosts in use) <a target=_blank rel="nofollow" href=http://www.4040.idv.tw/>website</a></span>
      </td>
      <td>public</td>
      <td><a href=/tools/contact.php?user_id=163078&subject=4040.idv.tw>paulliu</a>
      </td>
      <td>3605 days ago (10/31/2005)</td>
    </tr>
    <tr class="trd">
      <td><a href=/subdomain/edit.php?edit_domain_id=564813>remoteaccess.me</a>
        <br><span> (313 hosts in use) <a target=_blank rel="nofollow" href=http://www.remoteaccess.me/>website</a></span>
      </td>
      <td>private</td>
      <td><a href=/tools/contact.php?user_id=470899&subject=remoteaccess.me>theosophia</a>
      </td>
      <td>1790 days ago (11/07/2010)</td>
    </tr>
    <tr class="trl">
      <td><a href=/subdomain/edit.php?edit_domain_id=710791>stes.fi</a>
        <br><span> (311 hosts in use) <a target=_blank rel="nofollow" href=http://www.stes.fi/>website</a></span>
      </td>
      <td>private</td>
      <td><a href=/tools/contact.php?user_id=794524&subject=stes.fi>jkortela</a>
      </td>
      <td>1496 days ago (08/25/2011)</td>
    </tr>
    <tr class="trd">
      <td><a href=/subdomain/edit.php?edit_domain_id=841652>teamspeak.bz</a>
        <br><span> (308 hosts in use) <a target=_blank rel="nofollow" href=http://www.teamspeak.bz/>website</a></span>
      </td>
      <td>public</td>
      <td><a href=/tools/contact.php?user_id=373066&subject=teamspeak.bz>riki123123</a>
      </td>
      <td>1061 days ago (11/05/2012)</td>
    </tr>
    <tr class="trl">
      <td><a href=/subdomain/edit.php?edit_domain_id=238770>hs.vc</a>
        <br><span> (307 hosts in use) <a target=_blank rel="nofollow" href=http://www.hs.vc/>website</a></span>
      </td>
      <td>public</td>
      <td><a href=/tools/contact.php?user_id=382500&subject=hs.vc>xpuctoc</a>
      </td>
      <td>2730 days ago (04/12/2008)</td>
    </tr>
    <tr class="trd">
      <td><a href=/subdomain/edit.php?edit_domain_id=728119>oneindonesia.co.id</a>
        <br><span> (307 hosts in use) <a target=_blank rel="nofollow" href=http://www.oneindonesia.co.id/>website</a></span>
      </td>
      <td>public</td>
      <td><a href=/tools/contact.php?user_id=822755&subject=oneindonesia.co.id>basarah</a>
      </td>
      <td>1448 days ago (10/11/2011)</td>
  

这是我的剧本:

#!/bin/python
from bs4 import BeautifulSoup
import requests
response = requests.get('http://freedns.afraid.org/domain/registry/page-1.html')
soup = BeautifulSoup(response.text, 'html.parser')
pricing = soup.find(id = 'pricing')
first_column = pricing.find('centre', {'border': '0'})
for li in first_column.find('tr', {'class': 'trl'}):
    if 'public' in str(li).lower():
        public = li.find('a').text
print(public)  

返回错误:

Traceback (most recent call last):
File "afraid.org-scrape", line 9, in <module>
first_column = pricing.find('centre', {'border': '0'})
AttributeError: 'NoneType' object has no attribute 'find'

如何提取公开域名列表以及正在使用的主机数量,并打印到STDOUT或文本文件?
我希望输出清晰,清晰,简洁,没有额外的混乱。

1 个答案:

答案 0 :(得分:0)

而不是 pricing = soup.find(id = 'pricing') first_column = pricing.find('centre', {'border': '0'})

使用这样的东西 first_column = soup.find('table', {'border': '0'})