我正在使用scrapy来获取URL列表。一些URL被重定向到另一个< 302>。我想要的是计算单个URL发生的重定向数量,以及所有中间重定向urls.e.g的完整集合。
取出 - http://ign.com
重定向到 - http://de.ign.com/
redirect_count = 1
url_set = ['http://ign.com','http://de.ign.com/']
答案 0 :(得分:1)
您需要的是处理302 httpstatus
,
handle_httpstatus_list = [200, 302, 404] # any other if you want
这是一个例子:
将您items.py
定义为
from scrapy.item import Item, Field
class myItems(Item):
redirect_count = Field()
稍后在spider.py
,
from scrapy.spider import Spider
from scrapy.selector import Selector
from .items import myItems
class mainSpider(Spider):
name = "crazyCrawler"
allowed_domains = ['http://ign.com', 'http://de.ign.com/']
handle_httpstatus_list = [200, 302, 404] # any other if you want
start_urls = [
"http://ign.com"
]
def parse(self, response):
# spider
sel = Selector(response)
items = []
item = myItems()
item['redirect_count'] = 0
if response.status == 302:
item['redirect_count'] += 1
现在你可以跑了,
scrapy crawl crazyCrawler -o items.json