我的蜘蛛在MITOPENCOURSEWARE网站上搜索教学大纲不起作用。请有人帮我弄清楚它有什么问题吗? 。*是要进入所有课程。这是对的吗?
1 from scrapy.contrib.spiders import CrawlSpider, Rule
2 from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
3 from scrapy.selector import HtmlXPathSelector
4 from opensyllabi.items import OpensyllabiItem
5
6 class MITSpider(CrawlSpider):
7 name = 'mit'
8 allowed_domains = ['ocw.mit.edu']
9 start_urls = ['http://ocw.mit.edu/courses']
10 rules = [Rule(SgmlLinkExtractor(allow=['/.*/.*/syllabus']), 'parse_syllabus')]
11
12 def parse_syllabus(self, response):
13 x = HtmlXPathSelector(response)
14
15 syllabus = OpensyllabiItem()
16 syllabus['url'] = response.url
17 syllabus['body'] = x.select("//div[@id='course_inner_section']").extract()
18 return syllabus
答案 0 :(得分:1)
尝试:
rules = [
Rule(SgmlLinkExtractor(allow=r'/[^/]+/[^/]+/syllabus'), 'parse_syllabus'),
Rule(SgmlLinkExtractor()),
]
获取第一页上的所有链接,然后注意,这是很多链接。