在Python 2.7.2中使用REGEX检索字符串

时间:2013-07-04 20:30:15

标签: python regex

我有来自页面来源的以下代码段:

var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
  id: "pdfObjectContainer",
  width: "100%",
  height: "700px",
  pdfOpenParams: {
    navpanes: 0,
    statusbar: 1,
    toolbar: 1,
    view: "FitH"
  }
}).embed("pdf_placeholder"); 

'PDFObject('

在页面上是唯一的。我想使用REGEX来检索网址内容。在这种情况下,我需要得到

http://www.site.com/doc55.pdf

请帮忙。

7 个答案:

答案 0 :(得分:3)

以下是在不使用正则表达式的情况下解决问题的替代方法:

url,in_object = None, False
with open('input') as f:
    for line in f:
        in_object = in_object or 'PDFObject(' in line
        if in_object and 'url:' in line:
            url = line.split('"')[1]
            break
print url

答案 1 :(得分:0)

使用look-behind和look-ahead断言的组合

import re
re.search(r'(?<=url:).*?(?=",)', s).group().strip('" ')
'http://www.site.com/doc55.pdf'

答案 2 :(得分:0)

虽然其他答案可能看起来有效,但大多数答案都没有考虑到页面上唯一独特的东西是'PDFObject('。更好的正则表达式如下:

PDFObject\({\surl: "(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",

考虑到'PDFObject('是唯一的,包含一些基本的URL验证。

下面是如何在python

中使用此正则表达式的示例
>>> import re
>>> strs = """var myPDF = new PDFObject({
... url: "http://www.site.com/doc55.pdf",
...   id: "pdfObjectContainer",
...   width: "100%",
...   height: "700px",
...   pdfOpenParams: {
...     navpanes: 0,
...     statusbar: 1,
...     toolbar: 1,
...     view: "FitH"
...   }
... }).embed("pdf_placeholder");"""
>>> re.search(r'PDFObject\({\surl: "(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",',strs).group(1)
'http://www.site.com/doc55.pdf'

纯python(无正则表达式)替代方案是:

>>> unique = 'PDFObject({\nurl: "'
>>> start = strs.find(unique) + len(unique)
>>> end = start + strs[start:].find('"')
>>> strs[start:end]
'http://www.site.com/doc55.pdf'

没有正则表达式oneliner:

>>> (lambda u:(lambda s:(lambda e:strs[s:e])(s+strs[s:].find('"')))(strs.find(u)+len(u)))('PDFObject({\nurl: "')
'http://www.site.com/doc55.pdf'

答案 3 :(得分:0)

这有效:

import re

src='''\
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
URL: "http://www.site.com/doc52.PDF",
  id: "pdfObjectContainer",
  width: "100%",
  height: "700px",
  pdfOpenParams: {
    navpanes: 0,
    statusbar: 1,
    toolbar: 1,
    view: "FitH"
  }
}).embed("pdf_placeholder"); '''   

print [m.group(1).strip('"') for m in 
        re.finditer(r'^url:\s*(.*)[\W]$',
        re.search(r'PDFObject\(\{(.*)',src,re.M | re.S | re.I).group(1),re.M|re.I)]

打印:

['http://www.site.com/doc55.pdf', 'http://www.site.com/doc52.PDF']

答案 4 :(得分:0)

为了能够找到“在其他东西之后发生的事情”,你需要匹配“包括换行符”的内容。为此,您使用(dotall)修饰符 - 在编译期间添加的标志。

因此以下代码有效:

import re
r = re.compile(r'(?<=PDFObject).*?url:.*?(http.*?)"', re.DOTALL)
s = '''var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
  id: "pdfObjectContainer",
  width: "100%",
  height: "700px",
  pdfOpenParams: {
    navpanes: 0,
    statusbar: 1,
    toolbar: 1,
    view: "FitH"
  }
}).embed("pdf_placeholder"); '''

print r.findall(s)

说明:

r = re.compile(         compile regular expression
    r'                  treat this string as a regular expression
    (?<=PDFObject)      the match I want happens right after PDFObject
    .*?                 then there may be some other characters...
    url:                followed by the string url:
    .*?                 then match whatever follows until you get to the first instance (`?` : non-greedy match of
    (http:.*?)"         match the string http: up to (but not including) the first "
    ',                  end of regex string, but there's more...
    re.DOTALL)          set the DOTALL flag - this means the dot matches all characters
                        including newlines. This allows the match to continue from one line
                        to the next in the .*? right after the lookbehind

答案 5 :(得分:0)

正则表达式

new\s+PDFObject\(\{\s*url:\s*"[^"]+"

Regular expression image

演示

Extract url only

答案 6 :(得分:0)

如果'PDFObject('是页面中的唯一标识符,则只需匹配下一个引用的内容。

使用DOTALL flagre.DOTALLre.S)和非贪婪的明星(*?),您可以写道:

import re

snippet = '''                                    
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
  id: "pdfObjectContainer",
  width: "100%",
  height: "700px",
  pdfOpenParams: {
    navpanes: 0,
    statusbar: 1,
    toolbar: 1,
    view: "FitH"
  }
}).embed("pdf_placeholder");
'''

# First version using unnamed groups
RE_UNNAMED = re.compile(r'PDFObject\(.*?"(.*?)"', re.S)

# Second version using named groups
RE_NAMED = re.compile(r'PDFObject\(.*?"(?P<url>.*?)"', re.S)

RE_UNNAMED.search(snippet, re.S).group(1)
RE_NAMED.search(snippet, re.S).group('url')
# result for both: 'http://www.site.com/doc55.pdf'

如果您不想编译正则表达式,因为它只使用一次,只需使用以下语法:

re.search(r'PDFObject\(.*?"(.*?)"', snippet, re.S).group(1)
re.search(r'PDFObject\(.*?"(?P<url>.*?)"', snippet, re.S).group('url')

四种选择,一种应该符合你的需要和品味!