为什么我的Scrapy刮刀只返回结果的第二页?

时间:2013-09-18 00:39:22

标签: python regex screen-scraping scrapy

大学即将开始为我服务,所以我决定为Rate My Professor建立一个网络刮刀,以帮助我找到我学校评价最高的老师。刮刀工作得非常好......但仅适用于第二页!无论我尝试什么,我都无法让它正常工作。

这是我要抓取的网址:http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311&pageNo=3(不是我的实际大学,但网址结构相同)

这是我的蜘蛛:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from rmp.items import RmpItem

class MySpider(CrawlSpider):
    name = "rmp"
    allowed_domains = ["ratemyprofessors.com"]
    start_urls = ["http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311"]

    rules = (Rule(SgmlLinkExtractor(allow=('&pageNo=\d',), restrict_xpaths=('//a[@id="next"]',)), callback='parser', follow=True),)

    def parser(self, response):
        hxs = HtmlXPathSelector(response)
        html = hxs.select("//div[@class='entry odd vertical-center'] | // div[@class='entry even vertical-center']")
        profs = []
        for line in html:
            prof = RmpItem()
            prof["name"] = line.select("div[@class='profName']/a/text()").extract()
            prof["dept"] = line.select("div[@class='profDept']/text()").extract()
            prof["ratings"] = line.select("div[@class='profRatings']/      text()").extract()
            prof["avg"] = line.select("div[@class='profAvg']/text()").extract()
            profs.append(prof)

我尝试过的一些事情包括删除restrict_xpaths关键字参数(导致刮刀在第一个,最后一个,下一个和后退按钮之后,因为它们共享& pageNo = \ d URL结构)并更改allow关键字参数的正则表达式(导致没有变化)。

有人有什么建议吗?这似乎是一个简单的问题,但我已经花了一个半小时试图解决它!任何帮助都将非常感激。

2 个答案:

答案 0 :(得分:3)

当网站不符合预期的顺序时,网站不能很好地处理页面参数。 请参阅href值:

$ curl -q -s  "http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311&pageNo=2"  |grep \"next\"
    <a href="/SelectTeacher.jsp?sid=2311&pageNo=3" id="next">c</a>
$ curl -q -s  "http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&sid=2311"  |grep \"next\"
    <a href="/SelectTeacher.jsp?pageNo=2&sid=2311&pageNo=3" id="next">c</a>

为避免修改原始网址,您应该为canonicalize=False课程使用参数SgmlLinkExtractor。此外,您可能希望使用不太具体的xpath规则,因为使用当前规则,您无法获取起始URL的项目。

像这样:

rules = [
    Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="pagination"]', 
                           canonicalize=False),
         callback='parser', follow=True),
]

答案 1 :(得分:0)

我发布在Scrapy Google Groups页面上并收到了答案!这是:

我想你可能发现了一个错误

当我在scrapy shell中获取第一页时,SgmlLinkExtractor在第二页之后出现问题

(py2.7)paul @ wheezy:〜/ tmp / rmp $ scrapy shell http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311 ...

  
    
      来自scrapy.contrib.linkextractors.sgml的

导入SgmlLinkExtractor       SgmlLinkExtractor(restrict_xpaths =( '//一个[@ ID = “下一个”]',))。extract_links(响应)       [Link(url ='http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&sid=2311',text = u'c',fragment ='',nofollow = False)]

             

取( 'http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&sid=2311')       2013-09-19 02:05:38 + 0200 [rmpspider] DEBUG:Crawled(200)http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&sid=2311> (引用者:无)       ...       SgmlLinkExtractor(restrict_xpaths =( '//一个[@ ID = “下一个”]',))。extract_links(响应)       [Link(url ='http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&pageNo=3&sid=2311',text = u'c',fragment ='',nofollow = False)]

    
  

但是当我直接从第2页开始运行shell时,下一页就可以了, 但是第3页的下一个链接再次出错

(py2.7)paul @ wheezy:〜/ tmp / rmp $ scrapy shell“http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311&pageNo=2” ...

  
    
      来自scrapy.contrib.linkextractors.sgml的

导入SgmlLinkExtractor       SgmlLinkExtractor(restrict_xpaths =( '//一个[@ ID = “下一个”]',))。extract_links(响应)       [Link(url ='http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=3&sid=2311',text = u'c',fragment ='',nofollow = False)]

             

取( 'http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=3&sid=2311')       ...       SgmlLinkExtractor(restrict_xpaths =( '//一个[@ ID = “下一个”]',))。extract_links(响应)       [Link(url ='http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=3&pageNo=4&sid=2311',text = u'c',fragment ='',nofollow = False)]

    
  

与此同时, 您可以使用BaseSpider编写等效的蜘蛛并“手动”构建下一页请求,使用一个小的HtmlXPathSelector select()和urlparse.urljoin()

#from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
from scrapy.spider import BaseSpider
#from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from rmp.items import RmpItem
import urlparse

class MySpider(BaseSpider):
    name = "rmpspider"
    allowed_domains = ["ratemyprofessors.com"]
    start_urls = ["http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311"]

    #rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//a[@id="next"]',)), callback='parser', follow=True),)

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        html = hxs.select("//div[@class='entry odd vertical-center'] | // div[@class='entry even vertical-center']")
        for line in html:
            prof = RmpItem()
            prof["name"] = line.select("div[@class='profName']/a/text()").extract()
            yield prof

        for url in hxs.select('//a[@id="next"]/@href').extract():
            yield Request(urlparse.urljoin(response.url, url))