我试图爬到Kickstarter网站的不同页面。 问题是网页中的网站的起始网址不同。例如:
start_url(https://www.kickstarter.com/discover/advancedcontains)包含项目链接列表。我想转到列出的每个链接,例如https://www.kickstarter.com/projects/pirl/the-ultimate-charger-for-power-users?ref=discovery
所以,在链接提取器上,我放/项目,但它不会提取它。 (我认为它试图遍历" https://www.kickstarter.com/discover/advanced/projects") 当我做了
rules = [
Rule(LinkExtractor(allow=(''),),
callback='parse_item',
follow=True)
]
它列出了在该网站上提取的href,这很棒。但是,它不会将我引导到这些提取链接的主页。我怎么能这样做?
这是我的主要代码:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'kickstarter'
allowed_domains = ['kickstarter.com']
start_urls = ['http://www.kickstarter.com/discover/advanced']
rules = [
Rule(LinkExtractor(allow=(''),),
callback='parse_item',
follow=True)
]
def parse_item(self, response):
print (response.url)