Scrapy:遍历/爬行通过众筹网站

时间:2018-06-18 19:31:17

标签: web-scraping scrapy scrapy-spider

我试图爬到Kickstarter网站的不同页面。 问题是网页中的网站的起始网址不同。例如:

start_url(https://www.kickstarter.com/discover/advancedcontains)包含项目链接列表。我想转到列出的每个链接,例如https://www.kickstarter.com/projects/pirl/the-ultimate-charger-for-power-users?ref=discovery

所以,在链接提取器上,我放/项目,但它不会提取它。 (我认为它试图遍历" https://www.kickstarter.com/discover/advanced/projects") 当我做了

rules = [

    Rule(LinkExtractor(allow=(''),),
    callback='parse_item',
    follow=True)
]

它列出了在该网站上提取的href,这很棒。但是,它不会将我引导到这些提取链接的主页。我怎么能这样做?

这是我的主要代码:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
name = 'kickstarter'
allowed_domains = ['kickstarter.com']
start_urls = ['http://www.kickstarter.com/discover/advanced']

rules = [

    Rule(LinkExtractor(allow=(''),),
    callback='parse_item',
    follow=True)
]

def parse_item(self, response):
    print (response.url)

0 个答案:

没有答案