我正在尝试使用Scrapy抓取网站,我想要废弃的每个网页的网址都是使用这种相对路径编写的:
<!-- on page https://www.domain-name.com/en/somelist.html (no <base> in the <head>) -->
<a href="../../en/item-to-scrap.html">Link</a>
现在,在我的浏览器中,这些链接有效,你可以访问https://www.domain-name.com/en/item-to-scrap.html这样的网址(尽管相对路径在层次结构中重新上升两次而不是一次)
但是我的CrawlSpider无法将这些网址翻译成“正确”网址,而我得到的只是错误:
2013-10-13 09:30:41-0500 [domain-name.com] DEBUG: Retrying <GET https://www.domain-name.com/../en/item-to-scrap.html> (failed 1 times): 400 Bad Request
有没有办法解决这个问题,或者我错过了什么?
这是我的蜘蛛代码,相当基本(基于匹配“/en/item-*-scrap.html”的项目网址):
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Product(Item):
name = Field()
class siteSpider(CrawlSpider):
name = "domain-name.com"
allowed_domains = ['www.domain-name.com']
start_urls = ["https://www.domain-name.com/en/"]
rules = (
Rule(SgmlLinkExtractor(allow=('\/en\/item\-[a-z0-9\-]+\-scrap\.html')), callback='parse_item', follow=True),
Rule(SgmlLinkExtractor(allow=('')), follow=True),
)
def parse_item(self, response):
x = HtmlXPathSelector(response)
product = Product()
product['name'] = ''
name = x.select('//title/text()').extract()
if type(name) is list:
for s in name:
if s != ' ' and s != '':
product['name'] = s
break
return product
答案 0 :(得分:2)
基本上内心深处,scrapy使用http://docs.python.org/2/library/urlparse.html#urlparse.urljoin通过加入currenturl和url链接来获取下一个url。如果你加入你提到的网址,那么
<!-- on page https://www.domain-name.com/en/somelist.html -->
<a href="../../en/item-to-scrap.html">Link</a>
返回的url与错误scrapy错误中提到的url相同。在python shell中试试这个。
import urlparse
urlparse.urljoin("https://www.domain-name.com/en/somelist.html","../../en/item-to-scrap.html")
urljoin行为似乎有效。请参阅:http://tools.ietf.org/html/rfc1808.html#section-5.2
如果可能,您是否可以通过您正在抓取的网站?
根据这种理解,解决方案可以是,
1)操纵网址(删除这两个点并删除)。在爬行蜘蛛中生成。基本上覆盖解析或_request_to_folow。
抓取蜘蛛的来源:https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/crawl.py
2)操纵下载中间件中的URL,这可能更清晰。您删除downloadmiddleware的process_request中的../。
下载中间件的文档:http://scrapy.readthedocs.org/en/0.16/topics/downloader-middleware.html
3)使用基础蜘蛛并返回您想要进一步爬行的被操纵的URL请求
basespider的文档:http://scrapy.readthedocs.org/en/0.16/topics/spiders.html#basespider
如果您有任何疑问,请与我们联系。
答案 1 :(得分:1)
我终于通过this answer找到了解决方案。我使用了process_links如下:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class Product(Item):
name = Field()
class siteSpider(CrawlSpider):
name = "domain-name.com"
allowed_domains = ['www.domain-name.com']
start_urls = ["https://www.domain-name.com/en/"]
rules = (
Rule(SgmlLinkExtractor(allow=('\/en\/item\-[a-z0-9\-]+\-scrap\.html')), process_links='process_links', callback='parse_item', follow=True),
Rule(SgmlLinkExtractor(allow=('')), process_links='process_links', follow=True),
)
def parse_item(self, response):
x = HtmlXPathSelector(response)
product = Product()
product['name'] = ''
name = x.select('//title/text()').extract()
if type(name) is list:
for s in name:
if s != ' ' and s != '':
product['name'] = s
break
return product
def process_links(self,links):
for i, w in enumerate(links):
w.url = w.url.replace("../", "")
links[i] = w
return links