简单的Python网络爬虫

时间:2016-09-19 05:06:48

标签: python web-crawler

我正在关注youtube上的python教程,并开始了我们制作基本网络爬虫的地方。我试着让自己做一个非常简单的任务。转到craigslist上的我的城市汽车部分并打印每个条目的标题/链接,然后跳转到下一页并根据需要重复。它适用于第一页,但不会继续更改页面并获取数据。有人可以帮助解释什么是错的吗?

import requests
from bs4 import BeautifulSoup

def widow(max_pages):
    page = 0 # craigslist starts at page 0
    while page <= max_pages:
        url = 'http://orlando.craigslist.org/search/cto?s=' + str(page) # craigslist search url + current page number
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, 'lxml') # my computer yelled at me if 'lxml' wasn't included. your mileage may vary
        for link in soup.findAll('a', {'class':'hdrlnk'}):
            href = 'http://orlando.craigslist.org' + link.get('href') # href = /cto/'number'.html
            title = link.string
            print(title)
            print(href)
            page += 100 # craigslist pages go 0, 100, 200, etc

widow(0) # 0 gets the first page, replace with multiples of 100 for extra pages

1 个答案:

答案 0 :(得分:2)

看起来你的缩进有问题,你需要这样做 主要while块中的def get_object_or_404_custom(klass,fields = None, *args, **kwargs): queryset = _get_queryset(klass) try: if fields: queryset = queryset.only(*fields) return queryset.get(*args, **kwargs) except AttributeError: klass__name = klass.__name__ if isinstance(klass, type) else klass.__class__.__name__ raise ValueError( "First argument to get_object_or_404() must be a Model, Manager, " "or QuerySet, not '%s'." % klass__name ) except queryset.model.DoesNotExist: raise Http404('No %s matches the given query.' % queryset.model._meta.object_name) 和for循环中的

page += 100
相关问题