Question

我在网上使用scrapy（版本：1.1.1）进行scrapy。这是我面对的：

class Link_Spider(scrapy.Spider):
    name = 'GetLink'
    allowed_domains = ['example_0.com']
    with codecs.open('link.txt', 'r', 'utf-8') as f:
        start_urls = [url.strip() for url in f.readlines()]

def parse(self, response):
    print response.url

在上面的代码中，＆＃39; start_urls＆＃39; type是一个列表：

start_urls = [
              example_0.com/?id=0,
              example_0.com/?id=1,
              example_0.com/?id=2,
             ] # and so on

当scrapy运行时，调试信息告诉我：

[scrapy] DEBUG: Redirecting (302) to (GET https://example_1.com/?subid=poison_apple) from (GET http://example_0.com/?id=0)
[scrapy] DEBUG: Redirecting (301) to (GET https://example_1/ture_a.html) from (GET https://example_1.com/?subid=poison_apple)
[scrapy] DEBUG: Crawled (200) (GET https://example_1/ture_a.html) (referer: None)

现在，我怎么知道＆＃39; http://example_0.com/?id= ***＆＃39;在＆＃39; start_url＆＃39;与＆＃39; https://example_1/ture_a.html＆＃39;的网址配对？有人可以帮帮我吗？

Answer 1

每个回复都附有一个请求，因此您可以从中检索原始网址：

def parse(self, response):
    print('original url:')
    print(response.request.url)

如何在302重定向301之后获取第一个请求URL

1 个答案: