从subreddits imgur链接后scressed scrapy

时间:2015-10-09 22:36:33

标签: python scrapy screen-scraping scrapy-spider

我正在抓取reddit来获取subreddit中每个条目的链接。我也想关注与http://imgur.com/gallery/\w*匹配的链接。但我在运行Imgur的回调时遇到了问题。它只是没有执行它。什么失败了?

我用一个简单的if "http://imgur.com/gallery/" in item['link'][0]:语句检测了Imgur网址,也许scrapy提供了更好的方法来检测它们?

这就是我的尝试:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

from reddit.items import RedditItem


class RedditSpider(CrawlSpider):
    name = "reddit"
    allowed_domains = ["reddit.com"]
    start_urls = [
        "http://www.reddit.com/r/pics",
    ]

    rules = [
        Rule(
            LinkExtractor(allow=['/r/pics/\?count=\d.*&after=\w.*']),
            callback='parse_item',
            follow=True
        )
    ]

    def parse_item(self, response):
        for title in response.xpath("//div[contains(@class, 'entry')]/p/a"):
            item = RedditItem()
            item['title'] = title.xpath('text()').extract()
            item['link'] = title.xpath('@href').extract()

            yield item

            if "http://imgur.com/gallery/" in item['link'][0]:
                # print item['link'][0]
                url = response.urljoin(item['link'][0])
                print url
                yield scrapy.Request(url, callback=self.parse_imgur_gallery)

    def parse_imgur_gallery(self, response):
        print response.url

这是我的Item类:

import scrapy


class RedditItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()

这是使用--nolog执行蜘蛛并在if条件中打印url变量时的输出(它不是response.url var输出),它仍然没有& #39;运行回调:

PS C:\repos\python\scrapy\reddit> scrapy crawl --output=export.json --nolog reddit
http://imgur.com/gallery/W7sXs/new
http://imgur.com/gallery/v26KnSX
http://imgur.com/gallery/fqqBq
http://imgur.com/gallery/9GDTP/new
http://imgur.com/gallery/5gjLCPV
http://imgur.com/gallery/l6Tpavl
http://imgur.com/gallery/Ow4gQ
...

1 个答案:

答案 0 :(得分:1)

我找到了它。 {I}域名不允许使用allowed_domains = ["reddit.com", "imgur.com"] 域名。只需添加它......

{{1}}