坚持使用这个网络刮刀

时间:2015-07-01 08:11:18

标签: python-2.7 css-selectors web-scraping beautifulsoup

我正在尝试使用BeautifulSoup在Python 2.7中构建一个程序,该程序将从此页面和后续页面中提取所有配置文件URL

http://www.reaa.govt.nz/Pages/PublicRegisterSearch.aspx?pageNo=1&name=a *&安培; ORGNAME =安培;位置=安培; licenceNo =安培; itemsPerPage = 100安培;的SortExpression = 2

我已经和这个程序打了很长时间了,但它仍然不起作用。我想我正在弄乱CSS选择器,但我不确定还有什么可以尝试。

请指教......我是编程和python的新手

import requests
from bs4 import BeautifulSoup

def re_crawler(pages):
    page = 1
    while page <= pages:
        url = 'http://www.reaa.govt.nz/Pages/PublicRegisterSearch.aspx?pageNo=' + str(page) + '&name=a*&orgName=&location=&licenceNo=&itemsPerPage=100&sortExpression=2'
        code = requests.get(url)
        text = code.text
        soup = BeautifulSoup(text)
        for link in soup.select('tr.alternate td a[id*=ct100_]'):
            href = link.get('href')
            print (href)
        page += 1

re_crawler(2)

1 个答案:

答案 0 :(得分:1)

改用它?

from urllib import urlopen
from bs4 import BeautifulSoup

def re_crawler(pages):
    page = 1
    while page <= pages:
        url = 'http://www.reaa.govt.nz/Pages/PublicRegisterSearch.aspx?pageNo=' + str(page) + '&name=a*&orgName=&location=&licenceNo=&itemsPerPage=100&sortExpression=2'
        code = urlopen(url)
        soup = BeautifulSoup(code)
        for link in soup.select('tr.alternate td a[id*=ctl00_]'):
            href = link.get('href')
            print (href)
        page += 1

re_crawler(2)