我经常使用scrapy来检查一长串链接是否可用。
我的问题是链接的格式不正确(例如,链接不是以http://或https://开头)时,搜寻器崩溃了。
ValueError: Missing scheme in request url: http.www.gobiernoenlinea.gob.ve/noticias/viewNewsUser01.jsp?applet=1&id_noticia=41492
我阅读了熊猫系列的链接列表,并检查了每个链接。当响应可达时,我将其记录为“确定”,否则将其记录为“无效”。
import scrapy
import pandas as pd
from link_checker.items import LinkCheckerItem
class Checker(scrapy.Spider):
name = "link_checker"
def get_links(self):
df = pd.read_csv(r"final_07Sep2018.csv")
return df["Value"]
def start_requests(self):
urls = self.get_links()
for url in urls.iteritems():
index = {"index" : url[0]}
yield scrapy.Request(url=url[1], callback=self.get_response, errback=self.errback_httpbin, meta=index, dont_filter=True)
def get_response(self, response):
url = response.url
yield LinkCheckerItem(index=response.meta["index"], url=url, code="ok")
def errback_httpbin(self, failure):
yield LinkCheckerItem(index=failure.request.meta["index"], url=failure.request.url, code="dead")
我仍然有兴趣发现那些格式错误的网址。我该如何验证它们并为它们产生“死角”?
答案 0 :(得分:0)
您只需检查它是否以body>
<div align='center'>
<a href="#!/AdminLogin">Admin</a>
<a href="#!/Registration">Registation</a>
</div>
<div ng-view align='center'></div>
<!-- <script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.6.9/angular.min.js"></script>
<script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.6.9/angular-route.js"></script> -->
<script src="lib/angular.min.js"></script>
<script src="lib/angular-route.js"></script>
<script>
angular.module("myApp", ["ngRoute"])
.config(function($routeProvider){
$routeProvider
.when("/AdminLogin", {
template:"<h2>You are on Admin login</h2>"
})
.when("/Registration", {
template:"<h2>You are on Registration</h2>"
})
})
</script>
</body>
和https
开头
如果没有,则手动添加http
。
http