Question

我正在尝试在scrapy中获取最终重定向的URL。例如，如果锚标记具有特定格式：

<a href="http://www.example.com/index.php" class="FOO_X_Y_Z" />

然后我需要获取URL重定向到的URL（如果是，如果它的200然后OK）。例如，我得到了相应的锚标签：

def parse (self, response)  
    hxs     = HtmlXPathSelector (response);
    anchors = hxs.select("//a[@class='FOO_X_Y_Z']/@href");

    // Lets assume anchor contains the actual link (http://...)
    for anchor in anchors:
        final_url = get_final_url (anchor);   // << I would need something like this

        // Save final_url

因此，如果我访问http://www.example.com/index.php并且会通过10次重定向发送给我，最后它会停在http://www.example.com/final.php - 这就是我需要get_final_url()返回的内容。

我想到了解决问题的方法，但是我在这里询问scrapy是否已提供解决方案？

Answer 1

再次假设anchor包含实际网址，我用 urllib2 完成了它：

def parse (self, response)  
    hxs     = HtmlXPathSelector (response);
    anchors = hxs.select("//a[@class='FOO_X_Y_Z']/@href");

    // Lets assume anchor contains the actual link (http://...)
    for anchor in anchors:
        final_url = urllib2.open(anchor, None, 1).geturl()

        // Save final_url

urllib2.open()返回一个类似文件的对象，其中包含两个额外的方法，其中一个是geturl()，它返回最终的URL（在所有重定向都被跟踪之后）。它不是Scrapy的一部分，但它有效。

Answer 2

我使用response.headers来返回信息列表。新的网址值位于“位置”键旁边。

In [1]: response.headers
Out[1]: 
{'Date': 'Thu, 09 Jun 2016 00:18:18 GMT',
 'Location': 'https:/www.protiviti.com/en-US/Pages/default.aspx',
 'Server': 'nginx/1.9.1',
 'X-Ms-Invokeapp': '1; RequireReadOnly'}

Answer 3

这很简单：

print response.url #(inside parse() )

scrapy - 获取最终重定向的URL

3 个答案: