我正在从文本文件中的列表中删除。我正在抓取的网站有很多情况,文本文件中的URL重定向到另一个URL。我希望能够记录文本文件中的原始URL和重定向的URL。
我的蜘蛛代码如下:
import datetime
import urlparse
import socket
import scrapy
from scrapy.loader.processors import MapCompose, Join
from scrapy.loader import ItemLoader
from ..items import TermsItem
class BasicSpider(scrapy.Spider):
name = "basic"
allowed_domains = ["web"]
# Start on a property page
start_urls = [i.strip() for i in open('todo.urls.txt').readlines()]
def parse(self, response):
# Create the loader using the response
l = ItemLoader(item=TermsItem(), response=response)
# Load fields using XPath expressions
l.add_xpath('title', '//h1[@class="foo"]/span/text()',
MapCompose(unicode.strip, unicode.title))
l.add_xpath('detail', '//*[@class="bar"]//text()',
MapCompose(unicode.strip))
# Housekeeping fields
l.add_value('url', response.url)
l.add_value('project', self.settings.get('BOT_NAME'))
l.add_value('spider', self.name)
l.add_value('server', socket.gethostname())
l.add_value('date', datetime.datetime.now())
return l.load_item()
我的Items.py如下:
from scrapy.item import Item, Field
class TermsItem(Item):
# Primary fields
title = Field()
detail= Field()
# Housekeeping fields
url = Field()
project = Field()
spider = Field()
server = Field()
date = Field()
我是否需要进行回拨'它以某种方式与来自
的i.strip()相关联start_urls = [i.strip() for i in open('todo.urls.txt').readlines()]
然后在items.py中添加一个字段以加载到#Housekeeping字段?
我最初测试过更换:
l.add_value('url', response.url)
与
l.add_value('url', response.request.url)
但这产生了相同的结果。
非常感谢任何帮助。 的问候,
答案 0 :(得分:1)
您需要在蜘蛛中使用handle_httpstatus_list
属性。考虑以下示例
class First(Spider):
name = "redirect"
handle_httpstatus_list = [301, 302, 304, 307]
start_urls = ["http://www.google.com"]
def parse(self, response):
if 300 < response.status < 400:
redirect_to = response.headers['Location'].decode("utf-8")
print(response.url + " is being redirected to " + redirect_to)
# if we need to process this new location we need to yield it ourself
yield response.follow(redirect_to)
else:
print(response.url)
同样的输出是
2017-09-21 11:00:08 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://www.google.com> (referer: None)
http://www.google.com is being redirected to http://www.google.co.in/?gfe_rd=cr&dcr=0&ei=XU7DWYDrNquA8QeT0ZW4Cw
2017-09-21 11:00:08 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://www.google.co.in/?gfe_rd=cr&dcr=0&ei=XU7DWYDrNquA8QeT0ZW4Cw> (referer: None)
http://www.google.co.in/?gfe_rd=cr&dcr=0&ei=XU7DWYDrNquA8QeT0ZW4Cw is being redirected to https://www.google.co.in/?gfe_rd=cr&dcr=0&ei=XU7DWYDrNquA8QeT0ZW4Cw&gws_rd=ssl
2017-09-21 11:00:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.google.co.in/?gfe_rd=cr&dcr=0&ei=XU7DWYDrNquA8QeT0ZW4Cw&gws_rd=ssl> (referer: http://www.google.co.in/?gfe_rd=cr&dcr=0&ei=XU7DWYDrNquA8QeT0ZW4Cw)
https://www.google.co.in/?gfe_rd=cr&dcr=0&ei=XU7DWYDrNquA8QeT0ZW4Cw&gws_rd=ssl