无法用简单的scrapy蜘蛛抓取页面

时间:2016-01-01 08:58:39

标签: python web-scraping scrapy

我对scrapy非常陌生,我正在尝试使用简单的蜘蛛(基于此处发现的另一个蜘蛛建立:http://scraping.pro/web-scraping-python-scrapy-blog-series/)来抓取网站。

为什么我的蜘蛛爬行0页(没有错误):

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from items import NewsItem

class TutsPlus(CrawlSpider):
    name = "tutsplus"
    allowed_domains = ["net.tutsplus.com"]
    start_urls = [
    "http://code.tutsplus.com/posts?page="
    ]

    rules = [Rule(LinkExtractor(allow=['/posts?page=\d+']), 'parse_story')]

    def parse_story(self, response):
        story = NewsItem()  
        story['url'] = response.url
        story['title']     = response.xpath("//li[@class='posts__post']/a/text()").extract()        
        return story

非常相似的蜘蛛运行良好:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from items import NewsItem

class BbcSpider(CrawlSpider):
    name = "bbcnews"
    allowed_domains = ["bbc.co.uk"]
    start_urls = [
    "http://www.bbc.co.uk/news/technology/",
    ]

    rules = [Rule(LinkExtractor(allow=['/technology-\d+']), 'parse_story')]

    def parse_story(self, response):
        story = NewsItem()
        story['url'] = response.url
        story['headline'] = response.xpath("//title/text()").extract()
        story['intro'] = response.css('story-body__introduction::text').extract()
        return story

1 个答案:

答案 0 :(得分:0)

您的正则表达式'/posts?page=\d+'看起来不是您真正想要的,因为它匹配以下网址:'/postspage=2''/postpage=2'

我认为你需要像'/posts\?page=\d+'这样的东西,它会逃脱?