提取子字符串的上下文URL

时间:2016-04-13 16:34:57

标签: python regex scrapy

我正在构建一个scrapy应用程序,如果该URL中的子字符串匹配,我需要提取完整的URL。

例如:

让我们假设一个页面包含我感兴趣的以下网址:

  • /public/flag?cat=Computers/Programming/Languages/Python/Books&url=http://www.pearsonhighered.com/educator/academic/product/0,,0130260363,00%2Ben-USS_01DBC.html
  • /public/flag?cat=Computers/Programming/Languages/Python/Books&url=http://www.brpreiss.com/books/opus7/html/book.html
  • /public/flag?cat=Computers/Programming/Languages/Python/Books&url=http://www.diveintopython.net/
  • /public/flag?cat=Computers/Programming/Languages/Python/Books&url=http://rhodesmill.org/brandon/2011/foundations-of-python-network-programming/
  • [18更多]

但我的搜索字符串是flag?cat=Computers/Programming/Languages/Python/Books

仅返回网址的匹配部分,不返回完整网址。如何获取上面列出的完整网址?

这是一个基于示例的简单scrapy测试用例:

from scrapy.spiders import Spider
from scrapy.selector import Selector
import scrapy

class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
    ]

    def parse(self, response):
        #scrapy.shell.inspect_response( response, self )
        results = response.xpath('//body').re('(flag\?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks)')
        print results

输出:

[
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks',
    u'flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks'
]

预期输出:

[
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.pearsonhighered.com%2Feducator%2Facademic%2Fproduct%2F0%2C%2C0130260363%2C00%252Ben-USS_01DBC.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.brpreiss.com%2Fbooks%2Fopus7%2Fhtml%2Fbook.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.diveintopython.net%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Frhodesmill.org%2Fbrandon%2F2011%2Ffoundations-of-python-network-programming%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.techbooksforfree.com%2Fperlpython.shtml"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.freetechbooks.com%2Fpython-f6.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fgreenteapress.com%2Fthinkpython%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.network-theory.co.uk%2Fpython%2Fintro%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.freenetpages.co.uk%2Fhp%2Falan.gauld%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.wiley.com%2FWileyCDA%2FWileyTitle%2FproductCd-0471219754.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fhetland.org%2Fwriting%2Fpractical-python%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fsysadminpy.com%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.qtrac.eu%2Fpy3book.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.wiley.com%2FWileyCDA%2FWileyTitle%2FproductCd-0764548077.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=https%3A%2F%2Fwww.packtpub.com%2Fpython-3-object-oriented-programming%2Fbook"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.network-theory.co.uk%2Fpython%2Flanguage%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.pearsonhighered.com%2Feducator%2Facademic%2Fproduct%2F0%2C%2C0130409561%2C00%252Ben-USS_01DBC.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.informit.com%2Fstore%2Fproduct.aspx%3Fisbn%3D0201616165%26redir%3D1"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.pearsonhighered.com%2Feducator%2Facademic%2Fproduct%2F0%2C%2C0201748843%2C00%252Ben-USS_01DBC.html"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.informit.com%2Fstore%2Fproduct.aspx%3Fisbn%3D0672317354"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fgnosis.cx%2FTPiP%2F"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&amp;url=http%3A%2F%2Fwww.informit.com%2Fstore%2Fproduct.aspx%3Fisbn%3D0130211192"><img src="/img/flag.png" alt="[!]" title="report an issue with this listing'
]

1 个答案:

答案 0 :(得分:1)

问题是 .re()只会返回与表达式匹配的部分。相反,如果要继续使用正则表达式检查,请使用re:test()挂钩:

response.xpath('//body//a/@href[re:test(., "flag\?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks")]').extract()

在我的结尾处产生以下内容:

[
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&url=http%3A%2F%2Fwww.pearsonhighered.com%2Feducator%2Facademic%2Fproduct%2F0%2C%2C0130260363%2C00%252Ben-USS_01DBC.html', 
    u'/public/flag?cat=Computers%2FProgramming%2FLanguages%2FPython%2FBooks&url=http%3A%2F%2Fwww.brpreiss.com%2Fbooks%2Fopus7%2Fhtml%2Fbook.html',
    ...
]