如何使用Scrapy迭代网站?我想提取与http://www.saylor.org/site/syllabus.php?cid=NUMBER
匹配的所有网站的正文,其中NUMBER为1到400左右。
我写过这只蜘蛛:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from syllabi.items import SyllabiItem
class SyllabiSpider(CrawlSpider):
name = 'saylor'
allowed_domains = ['saylor.org']
start_urls = ['http://www.saylor.org/site/syllabus.php?cid=']
rules = [Rule(SgmlLinkExtractor(allow=['\d+']), 'parse_syllabi')]
def parse_syllabi(self, response):
x = HtmlXPathSelector(response)
syllabi = SyllabiItem()
syllabi['url'] = response.url
syllabi['body'] = x.select("/html/body/text()").extract()
return syllabi
但它不起作用。我知道它正在寻找start_url中的链接,这不是我想要的。我想迭代这些网站。有意义吗?
感谢您的帮助。
答案 0 :(得分:12)
试试这个:
from scrapy.spider import BaseSpider
from scrapy.http import Request
from syllabi.items import SyllabiItem
class SyllabiSpider(BaseSpider):
name = 'saylor'
allowed_domains = ['saylor.org']
max_cid = 400
def start_requests(self):
for i in range(self.max_cid):
yield Request('http://www.saylor.org/site/syllabus.php?cid=%d' % i,
callback=self.parse_syllabi)
def parse_syllabi(self, response):
syllabi = SyllabiItem()
syllabi['url'] = response.url
syllabi['body'] = response.body
return syllabi