如何使用scrapy CrawlSpider将相对路径转换为绝对路径?

时间:2017-11-11 14:53:26

标签: python scrapy web-crawler

我是Scrapy的新手,我正在尝试编写一个CrawlSpider,它将在Tor暗网上抓取一个论坛。目前我的CrawlSpider代码是:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class HiddenAnswersSpider(CrawlSpider):
    name = 'ha'
    start_urls = ['http://answerstedhctbek.onion/questions']
    allowed_domains = ['http://answerstedhctbek.onion', 'answerstedhctbek.onion']
    rules = (
            Rule(LinkExtractor(allow=(r'answerstedhctbek.onion/\d\.\*', r'https://answerstedhctbek.onion/\d\.\*')), follow=True, process_links='makeAbsolutePath'),
            Rule(LinkExtractor(allow=()), follow=True, process_links='makeAbsolutePath')

            )

 def makeAbsolutePath(links):
    for i in range(links):
          links[i] = links[i].replace("../","")
    return links

因为论坛使用了相对路径,所以我尝试创建一个自定义的process_links来删除" ../"但是,当我运行我的代码时,我仍在接受:

2017-11-11 14:46:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../badges>: HTTP status code is not handled or not allowed
2017-11-11 14:46:46 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../general-guidelines> (referer: http://answerstedhctbek.onion/questions)
2017-11-11 14:46:47 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../general-guidelines>: HTTP status code is not handled or not allowed
2017-11-11 14:46:47 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../contact-us> (referer: http://answerstedhctbek.onion/questions)
2017-11-11 14:46:47 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../contact-us>: HTTP status code is not handled or not allowed
2017-11-11 14:46:48 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../questions?sort=hot> (referer: http://answerstedhctbek.onion/questions)
2017-11-11 14:46:48 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://answerstedhctbek.onion/../questions?sort=hot>: HTTP status code is not handled or not allowed
2017-11-11 14:46:48 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://answerstedhctbek.onion/../questions?sort=votes> (referer: http://answerstedhctbek.onion/questions)

正如您所看到的,由于路径不好,我仍然会收到400个错误。为什么我的代码没有删除&#34; ../"从链接?

谢谢!

1 个答案:

答案 0 :(得分:0)

问题可能是makeAbsolutePaths不属于蜘蛛类。 The documentation states

  

process_links是可调用的或字符串(在这种情况下,将使用具有该名称的蜘蛛对象中的方法)

您未在self中使用makeAbsolutePaths,因此我认为这不是缩进错误。 makeAbsolutePaths还有一些其他错误。如果我们将代码更正为此状态:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class HiddenAnswersSpider(CrawlSpider):
    name = 'ha'
    start_urls = ['file:///home/user/testscrapy/test.html']
    allowed_domains = []
    rules = (
            Rule(LinkExtractor(allow=(r'.*')), follow=True, process_links='makeAbsolutePath'),
            )

    def makeAbsolutePath(self, links):
        print(links)
        for i in range(links):
            links[i] = links[i].replace("../","")
        return links

它会产生这个错误:

TypeError: 'list' object cannot be interpreted as an integer

这是因为在len()range的调用中没有使用对range的调用只能对整数进行操作。它想要一个数字,它将给你从0到这个数字减去1的范围。

解决此问题后,会出现错误:

AttributeError: 'Link' object has no attribute 'replace'

这是 - 因为与您的想法不同 - links不是包含href=""属性内容的字符串列表。相反,它是Link个对象的列表。

我建议您在links内输出makeAbsolutePath的内容,如果您必须做任何事情,请参阅。在我看来,scrapy应该已经停止解析..运营商一旦到达域级别,所以您的链接应该指向http://answerstedhctbek.onion/<number>/<title>,即使该网站使用..运算符而没有实际的文件夹级别(因为网址为/questions而非/questions/)。

不知怎的这样:

    def makeAbsolutePath(self, links):
        for i in range(len(links)):
            print(links[i].url)

        return []

(在这里返回一个空列表可以让你看到蜘蛛停止并且你可以检查控制台输出的优势)

如果您发现这些网址实际上是错误的,您可以通过url属性对其进行一些处理:

links[i].url = 'http://example.com'