我的目标是从网站上删除网址和标题列表,作为更大项目的一部分 - 这是驱使我学习scrapy的原因。现在,就目前而言,使用basespider刮取给定日期的第一页(格式为/ archive / date /)工作正常。但是,尝试使用crawlspider(处理一些教程)来刮取给定日期的每个顺序页面是行不通的,我不知道为什么。我尝试了很多解决方案。
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from physurlfetch.items import PhysurlfetchItem
from scrapy.http import Request
class PhysURLSpider(CrawlSpider):
date = raw_input("Please iput a date in the format M-DD-YYYY: ")
name = "PhysURLCrawlSpider"
allowed_domains = "phys.org"
start_url_str = ("http://phys.org/archive/%s/") % (date)
start_urls = [start_url_str]
rules = (
Rule (SgmlLinkExtractor(allow=("\d\.html",)),
callback="parse_items", follow = True),
)
#def parse_start_url(self, response):
#request = Request(start_urls, callback = self.parse_items)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//article[@class='news-box news-detail-box clearfix']/h4")
items = []
for titles in titles:
item = PhysurlfetchItem()
item ["title"] = titles.select("a/text()").extract()
item ["link"] = titles.select("a/@href").extract()
items.append(item)
return items
目前我已经将parse_start_url注释掉了,因为我试图阻止start_urls(使用变化的字符串)的方法失败了。运行此操作当前直接跳转到给定日期的第2页,而不从第1页抓取任何数据,然后停止(没有第2页数据,没有第3页)。
答案 0 :(得分:1)
当我在本地运行你的蜘蛛时(使用scrapy runspider yourspider.py
)我得到了这个控制台输出:
2014-01-10 13:30:19+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/> (referer: None)
2014-01-10 13:30:19+0100 [PhysURLCrawlSpider] DEBUG: Filtered offsite request to 'phys.org': <GET http://phys.org/archive/5-12-2013/page2.html>
2014-01-10 13:30:19+0100 [PhysURLCrawlSpider] INFO: Closing spider (finished)
你可以看到Scrapy是一个非现场查询的文件串。事实上,allowed_domains
应该是一个域列表,因此如果您更改为allowed_domains = ["phys.org"]
,您可以获得更多:
2014-01-10 13:32:00+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/> (referer: None)
2014-01-10 13:32:00+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page2.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:00+0100 [PhysURLCrawlSpider] DEBUG: Filtered duplicate request: <GET http://phys.org/archive/5-12-2013/page3.html> - no more duplicates will be shown (see DUPEFILTER_CLASS)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page8.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page6.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Redirecting (301) to <GET http://phys.org/archive/5-12-2013/> from <GET http://phys.org/archive/5-12-2013/page1.html>
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page4.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page7.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page5.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page3.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/> (referer: http://phys.org/archive/5-12-2013/page2.html)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] INFO: Closing spider (finished)
但蜘蛛没有拿起任何物品。它可能是也可能不是拼写错误,但titles
的XPath表达式应该是//article[@class='news-box news-detail-box clearfix']/h4
,即在clearfix
之前没有额外的空格。
最后请注意,如果您使用最新的Scrapy版本(从版本0.20.0开始),您将能够使用CSS selectors,这在选择元素时可能比XPath更具可读性多个班级:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from physurlfetch.items import PhysurlfetchItem
from scrapy.http import Request
class PhysURLSpider(CrawlSpider):
date = raw_input("Please iput a date in the format M-DD-YYYY: ")
name = "PhysURLCrawlSpider"
allowed_domains = ["phys.org"]
start_url_str = ("http://phys.org/archive/%s/") % (date)
start_urls = [start_url_str]
rules = (
Rule (SgmlLinkExtractor(allow=("\d\.html",)),
callback="parse_items", follow = True),
)
#def parse_start_url(self, response):
#request = Request(start_urls, callback = self.parse_items)
def parse_items(self, response):
selector = Selector(response)
# selecting only using "news-detail-box" class
# you could use "article.news-box.news-detail-box.clearfix > h4"
titles = selector.css("article.news-detail-box > h4")
items = []
for titles in titles:
item = PhysurlfetchItem()
item ["title"] = titles.xpath("a/text()").extract()
item ["link"] = titles.xpath("a/@href").extract()
items.append(item)
self.log("%d items in %s" % (len(items), response.url))
return items