大学即将开始为我服务,所以我决定为Rate My Professor建立一个网络刮刀,以帮助我找到我学校评价最高的老师。刮刀工作得非常好......但仅适用于第二页!无论我尝试什么,我都无法让它正常工作。
这是我要抓取的网址:http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311&pageNo=3(不是我的实际大学,但网址结构相同)
这是我的蜘蛛:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from rmp.items import RmpItem
class MySpider(CrawlSpider):
name = "rmp"
allowed_domains = ["ratemyprofessors.com"]
start_urls = ["http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311"]
rules = (Rule(SgmlLinkExtractor(allow=('&pageNo=\d',), restrict_xpaths=('//a[@id="next"]',)), callback='parser', follow=True),)
def parser(self, response):
hxs = HtmlXPathSelector(response)
html = hxs.select("//div[@class='entry odd vertical-center'] | // div[@class='entry even vertical-center']")
profs = []
for line in html:
prof = RmpItem()
prof["name"] = line.select("div[@class='profName']/a/text()").extract()
prof["dept"] = line.select("div[@class='profDept']/text()").extract()
prof["ratings"] = line.select("div[@class='profRatings']/ text()").extract()
prof["avg"] = line.select("div[@class='profAvg']/text()").extract()
profs.append(prof)
我尝试过的一些事情包括删除restrict_xpaths关键字参数(导致刮刀在第一个,最后一个,下一个和后退按钮之后,因为它们共享& pageNo = \ d URL结构)并更改allow关键字参数的正则表达式(导致没有变化)。
有人有什么建议吗?这似乎是一个简单的问题,但我已经花了一个半小时试图解决它!任何帮助都将非常感激。
答案 0 :(得分:3)
当网站不符合预期的顺序时,网站不能很好地处理页面参数。
请参阅href
值:
$ curl -q -s "http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311&pageNo=2" |grep \"next\"
<a href="/SelectTeacher.jsp?sid=2311&pageNo=3" id="next">c</a>
$ curl -q -s "http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&sid=2311" |grep \"next\"
<a href="/SelectTeacher.jsp?pageNo=2&sid=2311&pageNo=3" id="next">c</a>
为避免修改原始网址,您应该为canonicalize=False
课程使用参数SgmlLinkExtractor
。此外,您可能希望使用不太具体的xpath规则,因为使用当前规则,您无法获取起始URL的项目。
像这样:
rules = [
Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="pagination"]',
canonicalize=False),
callback='parser', follow=True),
]
答案 1 :(得分:0)
我发布在Scrapy Google Groups页面上并收到了答案!这是:
我想你可能发现了一个错误
当我在scrapy shell中获取第一页时,SgmlLinkExtractor在第二页之后出现问题
(py2.7)paul @ wheezy:〜/ tmp / rmp $ scrapy shell http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311 ...
来自scrapy.contrib.linkextractors.sgml的导入SgmlLinkExtractor SgmlLinkExtractor(restrict_xpaths =( '//一个[@ ID = “下一个”]',))。extract_links(响应) [Link(url ='http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&sid=2311',text = u'c',fragment ='',nofollow = False)]
取( 'http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&sid=2311') 2013-09-19 02:05:38 + 0200 [rmpspider] DEBUG:Crawled(200)http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&sid=2311> (引用者:无) ... SgmlLinkExtractor(restrict_xpaths =( '//一个[@ ID = “下一个”]',))。extract_links(响应) [Link(url ='http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=2&pageNo=3&sid=2311',text = u'c',fragment ='',nofollow = False)]
但是当我直接从第2页开始运行shell时,下一页就可以了, 但是第3页的下一个链接再次出错
(py2.7)paul @ wheezy:〜/ tmp / rmp $ scrapy shell“http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311&pageNo=2” ...
来自scrapy.contrib.linkextractors.sgml的导入SgmlLinkExtractor SgmlLinkExtractor(restrict_xpaths =( '//一个[@ ID = “下一个”]',))。extract_links(响应) [Link(url ='http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=3&sid=2311',text = u'c',fragment ='',nofollow = False)]
取( 'http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=3&sid=2311') ... SgmlLinkExtractor(restrict_xpaths =( '//一个[@ ID = “下一个”]',))。extract_links(响应) [Link(url ='http://www.ratemyprofessors.com/SelectTeacher.jsp?pageNo=3&pageNo=4&sid=2311',text = u'c',fragment ='',nofollow = False)]
与此同时, 您可以使用BaseSpider编写等效的蜘蛛并“手动”构建下一页请求,使用一个小的HtmlXPathSelector select()和urlparse.urljoin()
#from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
from scrapy.spider import BaseSpider
#from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from rmp.items import RmpItem
import urlparse
class MySpider(BaseSpider):
name = "rmpspider"
allowed_domains = ["ratemyprofessors.com"]
start_urls = ["http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311"]
#rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//a[@id="next"]',)), callback='parser', follow=True),)
def parse(self, response):
hxs = HtmlXPathSelector(response)
html = hxs.select("//div[@class='entry odd vertical-center'] | // div[@class='entry even vertical-center']")
for line in html:
prof = RmpItem()
prof["name"] = line.select("div[@class='profName']/a/text()").extract()
yield prof
for url in hxs.select('//a[@id="next"]/@href').extract():
yield Request(urlparse.urljoin(response.url, url))