我正在抓取reddit来获取subreddit中每个条目的链接。我也想关注与http://imgur.com/gallery/\w*
匹配的链接。但我在运行Imgur的回调时遇到了问题。它只是没有执行它。什么失败了?
我用一个简单的if "http://imgur.com/gallery/" in item['link'][0]:
语句检测了Imgur网址,也许scrapy提供了更好的方法来检测它们?
这就是我的尝试:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from reddit.items import RedditItem
class RedditSpider(CrawlSpider):
name = "reddit"
allowed_domains = ["reddit.com"]
start_urls = [
"http://www.reddit.com/r/pics",
]
rules = [
Rule(
LinkExtractor(allow=['/r/pics/\?count=\d.*&after=\w.*']),
callback='parse_item',
follow=True
)
]
def parse_item(self, response):
for title in response.xpath("//div[contains(@class, 'entry')]/p/a"):
item = RedditItem()
item['title'] = title.xpath('text()').extract()
item['link'] = title.xpath('@href').extract()
yield item
if "http://imgur.com/gallery/" in item['link'][0]:
# print item['link'][0]
url = response.urljoin(item['link'][0])
print url
yield scrapy.Request(url, callback=self.parse_imgur_gallery)
def parse_imgur_gallery(self, response):
print response.url
这是我的Item类:
import scrapy
class RedditItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
这是使用--nolog
执行蜘蛛并在if条件中打印url
变量时的输出(它不是response.url
var输出),它仍然没有& #39;运行回调:
PS C:\repos\python\scrapy\reddit> scrapy crawl --output=export.json --nolog reddit
http://imgur.com/gallery/W7sXs/new
http://imgur.com/gallery/v26KnSX
http://imgur.com/gallery/fqqBq
http://imgur.com/gallery/9GDTP/new
http://imgur.com/gallery/5gjLCPV
http://imgur.com/gallery/l6Tpavl
http://imgur.com/gallery/Ow4gQ
...
答案 0 :(得分:1)
我找到了它。 {I}域名不允许使用allowed_domains = ["reddit.com", "imgur.com"]
域名。只需添加它......
{{1}}