Scrapy Crawler - 如何指定要爬网的链接

时间:2013-10-30 08:06:37

标签: python web-scraping beautifulsoup scrapy

我正在使用scrapy抓取我的网站http://www.cseblog.com

我的蜘蛛如下:

from scrapy.spider import BaseSpider
from bs4 import BeautifulSoup ## This is BeautifulSoup4
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from blogscraper.items import BlogArticle ## This is for saving data. Probably insignificant.

class BlogArticleSpider(BaseSpider):
    name = "blogscraper"
    allowed_domains = ["cseblog.com"]
    start_urls = [
        "http://www.cseblog.com/",
    ]

    rules = (
        Rule(SgmlLinkExtractor(allow=('\d+/\d+/*"', ), deny=( ))),
    )

    def parse(self, response):
        site = BeautifulSoup(response.body_as_unicode())
        items = []
        item = BlogArticle()
        item['title'] = site.find("h3" , {"class": "post-title" } ).text.strip()
        item['link'] = site.find("h3" , {"class": "post-title" } ).a.attrs['href']
        item['text'] = site.find("div" , {"class": "post-body" } )
        items.append(item)
        return items

我在哪里指定它需要抓取该类型的所有链接 http://www.cseblog.com/{d+}/{d+}/{*}.html和 http://www.cseblog.com/search/{*} 递归

但保存数据 http://www.cseblog.com/ {d +} / {d +} / {*}。HTML

1 个答案:

答案 0 :(得分:1)

您必须创建两个规则或一个告知scrapy以允许这些类型的URL。基本上你想要规则列表将是这样的

rules = (
        Rule(SgmlLinkExtractor(allow=('http://www.cseblog.com/{d+}/{d+}/{*}.html', ), deny=( )),call_back ='parse_save' ),
        Rule(SgmlLinkExtractor(allow=('http://www.cseblog.com/search/{*}', ), deny=( )),,call_back = 'parse_only' ))

顺便说一句,您应该使用爬网蜘蛛并重命名解析方法名称,除非您要从基类覆盖该方法。

两种链接类型都有不同的回调,实际上,您可以决定要保存哪些已处理的页面数据。而不是只有一个回调,并再次检查response.url。