Scrapy:在URL重定向的情况下,使用文本文件记录请求的原始URL

时间:2017-09-21 05:04:40

标签: python-2.7 scrapy

我正在从文本文件中的列表中删除。我正在抓取的网站有很多情况,文本文件中的URL重定向到另一个URL。我希望能够记录文本文件中的原始URL和重定向的URL。

我的蜘蛛代码如下:

import datetime
import urlparse
import socket
import scrapy

from scrapy.loader.processors import MapCompose, Join
from scrapy.loader import ItemLoader

from ..items import TermsItem


class BasicSpider(scrapy.Spider):
    name = "basic"
    allowed_domains = ["web"]

    # Start on a property page
    start_urls = [i.strip() for i in open('todo.urls.txt').readlines()]

    def parse(self, response):
        # Create the loader using the response
        l = ItemLoader(item=TermsItem(), response=response)

        # Load fields using XPath expressions
        l.add_xpath('title', '//h1[@class="foo"]/span/text()',
                    MapCompose(unicode.strip, unicode.title))
        l.add_xpath('detail', '//*[@class="bar"]//text()',
                    MapCompose(unicode.strip))

        # Housekeeping fields
        l.add_value('url', response.url)
        l.add_value('project', self.settings.get('BOT_NAME'))
        l.add_value('spider', self.name)
        l.add_value('server', socket.gethostname())
        l.add_value('date', datetime.datetime.now())

        return l.load_item() 

我的Items.py如下:

from scrapy.item import Item, Field


class TermsItem(Item):
    # Primary fields
    title = Field()
    detail= Field()
    # Housekeeping fields
    url = Field()
    project = Field()
    spider = Field()
    server = Field()
    date = Field()

我是否需要进行回拨'它以某种方式与来自

的i.strip()相关联
start_urls = [i.strip() for i in open('todo.urls.txt').readlines()]

然后在items.py中添加一个字段以加载到#Housekeeping字段?

我最初测试过更换:

l.add_value('url', response.url)

l.add_value('url', response.request.url)

但这产生了相同的结果。

非常感谢任何帮助。 的问候,

1 个答案:

答案 0 :(得分:1)

您需要在蜘蛛中使用handle_httpstatus_list属性。考虑以下示例

class First(Spider):
    name = "redirect"
    handle_httpstatus_list = [301, 302, 304, 307]
    start_urls = ["http://www.google.com"]

    def parse(self, response):
        if 300 < response.status < 400:
            redirect_to = response.headers['Location'].decode("utf-8")
            print(response.url + " is being redirected to " + redirect_to)

            # if we need to process this new location we need to yield it ourself
            yield response.follow(redirect_to)
        else:
            print(response.url)

同样的输出是

2017-09-21 11:00:08 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://www.google.com> (referer: None)
http://www.google.com is being redirected to http://www.google.co.in/?gfe_rd=cr&dcr=0&ei=XU7DWYDrNquA8QeT0ZW4Cw
2017-09-21 11:00:08 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://www.google.co.in/?gfe_rd=cr&dcr=0&ei=XU7DWYDrNquA8QeT0ZW4Cw> (referer: None)
http://www.google.co.in/?gfe_rd=cr&dcr=0&ei=XU7DWYDrNquA8QeT0ZW4Cw is being redirected to https://www.google.co.in/?gfe_rd=cr&dcr=0&ei=XU7DWYDrNquA8QeT0ZW4Cw&gws_rd=ssl
2017-09-21 11:00:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.google.co.in/?gfe_rd=cr&dcr=0&ei=XU7DWYDrNquA8QeT0ZW4Cw&gws_rd=ssl> (referer: http://www.google.co.in/?gfe_rd=cr&dcr=0&ei=XU7DWYDrNquA8QeT0ZW4Cw)
https://www.google.co.in/?gfe_rd=cr&dcr=0&ei=XU7DWYDrNquA8QeT0ZW4Cw&gws_rd=ssl