如何使scrapy跟随无效链接?

时间:2018-09-07 15:52:07

标签: python python-3.x scrapy

我经常使用scrapy来检查一长串链接是否可用。

我的问题是链接的格式不正确(例如,链接不是以http://或https://开头)时,搜寻器崩溃了。

ValueError: Missing scheme in request url: http.www.gobiernoenlinea.gob.ve/noticias/viewNewsUser01.jsp?applet=1&id_noticia=41492

我阅读了熊猫系列的链接列表,并检查了每个链接。当响应可达时,我将其记录为“确定”,否则将其记录为“无效”。

import scrapy
import pandas as pd
from link_checker.items import LinkCheckerItem



class Checker(scrapy.Spider):
    name = "link_checker"


    def get_links(self):
        df = pd.read_csv(r"final_07Sep2018.csv")
        return df["Value"]

    def start_requests(self):
        urls = self.get_links()
        for url in urls.iteritems():
            index = {"index" : url[0]}
            yield scrapy.Request(url=url[1], callback=self.get_response, errback=self.errback_httpbin, meta=index, dont_filter=True)

    def get_response(self, response):
        url = response.url

        yield LinkCheckerItem(index=response.meta["index"], url=url, code="ok")

    def errback_httpbin(self, failure):
        yield LinkCheckerItem(index=failure.request.meta["index"], url=failure.request.url, code="dead")

我仍然有兴趣发现那些格式错误的网址。我该如何验证它们并为它们产生“死角”?

1 个答案:

答案 0 :(得分:0)

您只需检查它是否以body> <div align='center'> <a href="#!/AdminLogin">Admin</a> <a href="#!/Registration">Registation</a> </div> <div ng-view align='center'></div> <!-- <script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.6.9/angular.min.js"></script> <script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.6.9/angular-route.js"></script> --> <script src="lib/angular.min.js"></script> <script src="lib/angular-route.js"></script> <script> angular.module("myApp", ["ngRoute"]) .config(function($routeProvider){ $routeProvider .when("/AdminLogin", { template:"<h2>You are on Admin login</h2>" }) .when("/Registration", { template:"<h2>You are on Registration</h2>" }) }) </script> </body>https开头

如果没有,则手动添加http

http