Question

大学即将开始为我服务，所以我决定为Rate My Professor建立一个网络刮刀，以帮助我找到我学校评价最高的老师。刮刀工作得非常好......但仅适用于第二页！无论我尝试什么，我都无法让它正常工作。

这是我要抓取的网址：http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311&pageNo=3（不是我的实际大学，但网址结构相同）

这是我的蜘蛛：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from rmp.items import RmpItem

class MySpider(CrawlSpider):
    name = "rmp"
    allowed_domains = ["ratemyprofessors.com"]
    start_urls = ["http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311"]

    rules = (Rule(SgmlLinkExtractor(allow=('&pageNo=\d',), restrict_xpaths=('//a[@id="next"]',)), callback='parser', follow=True),)

    def parser(self, response):
        hxs = HtmlXPathSelector(response)
        html = hxs.select("//div[@class='entry odd vertical-center'] | // div[@class='entry even vertical-center']")
        profs = []
        for line in html:
            prof = RmpItem()
            prof["name"] = line.select("div[@class='profName']/a/text()").extract()
            prof["dept"] = line.select("div[@class='profDept']/text()").extract()
            prof["ratings"] = line.select("div[@class='profRatings']/      text()").extract()
            prof["avg"] = line.select("div[@class='profAvg']/text()").extract()
            profs.append(prof)

我尝试过的一些事情包括删除restrict_xpaths关键字参数（导致刮刀在第一个，最后一个，下一个和后退按钮之后，因为它们共享＆amp; pageNo = \ d URL结构）并更改allow关键字参数的正则表达式（导致没有变化）。

有人有什么建议吗？这似乎是一个简单的问题，但我已经花了一个半小时试图解决它！任何帮助都将非常感激。

Answer 1

当网站不符合预期的顺序时，网站不能很好地处理页面参数。请参阅href值：

$ curl -q -s  "http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311&pageNo=2"  |grep \"next\"
    <a href="/SelectTeacher.jsp?sid=2311&pageNo=3" id="next">c</a>
$ curl -q -s  "http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&sid=2311"  |grep \"next\"
    <a href="/SelectTeacher.jsp?pageNo=2&sid=2311&pageNo=3" id="next">c</a>

为避免修改原始网址，您应该为canonicalize=False课程使用参数SgmlLinkExtractor。此外，您可能希望使用不太具体的xpath规则，因为使用当前规则，您无法获取起始URL的项目。

像这样：

rules = [
    Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="pagination"]', 
                           canonicalize=False),
         callback='parser', follow=True),
]

Answer 2

我发布在Scrapy Google Groups页面上并收到了答案！这是：

我想你可能发现了一个错误

当我在scrapy shell中获取第一页时，SgmlLinkExtractor在第二页之后出现问题

（py2.7）paul @ wheezy：〜/ tmp / rmp $ scrapy shell http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311 ...

      来自scrapy.contrib.linkextractors.sgml的
导入SgmlLinkExtractor       SgmlLinkExtractor（restrict_xpaths =（ '//一个[@ ID = “下一个”]'，））。extract_links（响应）       [Link（url ='http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&sid=2311'，text = u'c'，fragment =''，nofollow = False）]

取（ 'http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&sid=2311'）       2013-09-19 02：05：38 + 0200 [rmpspider] DEBUG：Crawled（200）http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&sid=2311> （引用者：无）       ...       SgmlLinkExtractor（restrict_xpaths =（ '//一个[@ ID = “下一个”]'，））。extract_links（响应）       [Link（url ='http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&pageNo=3&sid=2311'，text = u'c'，fragment =''，nofollow = False）]

但是当我直接从第2页开始运行shell时，下一页就可以了，但是第3页的下一个链接再次出错

（py2.7）paul @ wheezy：〜/ tmp / rmp $ scrapy shell“http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311&pageNo=2” ...

      来自scrapy.contrib.linkextractors.sgml的
导入SgmlLinkExtractor       SgmlLinkExtractor（restrict_xpaths =（ '//一个[@ ID = “下一个”]'，））。extract_links（响应）       [Link（url ='http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=3&sid=2311'，text = u'c'，fragment =''，nofollow = False）]

取（ 'http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=3&sid=2311'）       ...       SgmlLinkExtractor（restrict_xpaths =（ '//一个[@ ID = “下一个”]'，））。extract_links（响应）       [Link（url ='http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=3&pageNo=4&sid=2311'，text = u'c'，fragment =''，nofollow = False）]

与此同时，您可以使用BaseSpider编写等效的蜘蛛并“手动”构建下一页请求，使用一个小的HtmlXPathSelector select（）和urlparse.urljoin（）

#from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
from scrapy.spider import BaseSpider
#from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from rmp.items import RmpItem
import urlparse

class MySpider(BaseSpider):
    name = "rmpspider"
    allowed_domains = ["ratemyprofessors.com"]
    start_urls = ["http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311"]

    #rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//a[@id="next"]',)), callback='parser', follow=True),)

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        html = hxs.select("//div[@class='entry odd vertical-center'] | // div[@class='entry even vertical-center']")
        for line in html:
            prof = RmpItem()
            prof["name"] = line.select("div[@class='profName']/a/text()").extract()
            yield prof

        for url in hxs.select('//a[@id="next"]/@href').extract():
            yield Request(urlparse.urljoin(response.url, url))

为什么我的Scrapy刮刀只返回结果的第二页？

2 个答案: