Python 3.4 - 通过n URLS循环,其中n未修复

时间:2016-02-23 03:08:45

标签: python loops url web-scraping

循环播放系列网址的最简单方法是什么,直到没有更多结果返回?

如果URL的数量是固定的,例如9,那么类似下面的代码就可以了

for i in range(1,10):
    print('http://www.trademe.co.nz/browse/categorylistings.aspx?v=list&rptpath=4-380-50-7145-&mcatpath=sports%2fcycling%2fmountain-bikes%2ffull-suspension&page='+ str(i)+'&sort_order=default ')

但是,网址数量是动态的,我得到一个页面,上面写着“抱歉,此类别中目前没有商品详情”。当我超调。示例如下。

http://www.trademe.co.nz/browse/categorylistings.aspx?v=list&rptpath=4-380-50-7145-&mcatpath=sports%2fcycling%2fmountain-bikes%2ffull-suspension&page=10&sort_order=default

仅返回包含结果的网页的最简单方法是什么?

干杯 史蒂夫

1 个答案:

答案 0 :(得分:1)

# count is an iterator that just keeps going
# from itertools import count
# but I'm not going to use it, because you want to set a reasonable limit
# otherwise you'll loop endlessly if your end condition fails

# requests is third party but generally better than the standard libs
import requests

base_url = 'http://www.trademe.co.nz/browse/categorylistings.aspx?v=list&rptpath=4-380-50-7145-&mcatpath=sports%2fcycling%2fmountain-bikes%2ffull-suspension&page={}&sort_order=default'

for i in range(1, 30):
    result = requests.get(base_url.format(i))
    if result.status_code != 200:
        break
    content = result.content.decode('utf-8')
    # Note, this is actually quite fragile
    # For example, they have 2 spaces between 'no' and 'listings'
    # so looking for 'no listings' would break
    # for a more robust solution be more clever.
    if 'Sorry, there are currently no' in content:
        break

    # do stuff with your content here
    print(i)