我有来自页面来源的以下代码段:
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
id: "pdfObjectContainer",
width: "100%",
height: "700px",
pdfOpenParams: {
navpanes: 0,
statusbar: 1,
toolbar: 1,
view: "FitH"
}
}).embed("pdf_placeholder");
的
'PDFObject('
在页面上是唯一的。我想使用REGEX来检索网址内容。在这种情况下,我需要得到
http://www.site.com/doc55.pdf
请帮忙。
答案 0 :(得分:3)
以下是在不使用正则表达式的情况下解决问题的替代方法:
url,in_object = None, False
with open('input') as f:
for line in f:
in_object = in_object or 'PDFObject(' in line
if in_object and 'url:' in line:
url = line.split('"')[1]
break
print url
答案 1 :(得分:0)
使用look-behind和look-ahead断言的组合
import re
re.search(r'(?<=url:).*?(?=",)', s).group().strip('" ')
'http://www.site.com/doc55.pdf'
答案 2 :(得分:0)
虽然其他答案可能看起来有效,但大多数答案都没有考虑到页面上唯一独特的东西是'PDFObject('。更好的正则表达式如下:
PDFObject\({\surl: "(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",
考虑到'PDFObject('是唯一的,包含一些基本的URL验证。
下面是如何在python
中使用此正则表达式的示例>>> import re
>>> strs = """var myPDF = new PDFObject({
... url: "http://www.site.com/doc55.pdf",
... id: "pdfObjectContainer",
... width: "100%",
... height: "700px",
... pdfOpenParams: {
... navpanes: 0,
... statusbar: 1,
... toolbar: 1,
... view: "FitH"
... }
... }).embed("pdf_placeholder");"""
>>> re.search(r'PDFObject\({\surl: "(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",',strs).group(1)
'http://www.site.com/doc55.pdf'
纯python(无正则表达式)替代方案是:
>>> unique = 'PDFObject({\nurl: "'
>>> start = strs.find(unique) + len(unique)
>>> end = start + strs[start:].find('"')
>>> strs[start:end]
'http://www.site.com/doc55.pdf'
没有正则表达式oneliner:
>>> (lambda u:(lambda s:(lambda e:strs[s:e])(s+strs[s:].find('"')))(strs.find(u)+len(u)))('PDFObject({\nurl: "')
'http://www.site.com/doc55.pdf'
答案 3 :(得分:0)
这有效:
import re
src='''\
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
URL: "http://www.site.com/doc52.PDF",
id: "pdfObjectContainer",
width: "100%",
height: "700px",
pdfOpenParams: {
navpanes: 0,
statusbar: 1,
toolbar: 1,
view: "FitH"
}
}).embed("pdf_placeholder"); '''
print [m.group(1).strip('"') for m in
re.finditer(r'^url:\s*(.*)[\W]$',
re.search(r'PDFObject\(\{(.*)',src,re.M | re.S | re.I).group(1),re.M|re.I)]
打印:
['http://www.site.com/doc55.pdf', 'http://www.site.com/doc52.PDF']
答案 4 :(得分:0)
为了能够找到“在其他东西之后发生的事情”,你需要匹配“包括换行符”的内容。为此,您使用(dotall)修饰符 - 在编译期间添加的标志。
因此以下代码有效:
import re
r = re.compile(r'(?<=PDFObject).*?url:.*?(http.*?)"', re.DOTALL)
s = '''var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
id: "pdfObjectContainer",
width: "100%",
height: "700px",
pdfOpenParams: {
navpanes: 0,
statusbar: 1,
toolbar: 1,
view: "FitH"
}
}).embed("pdf_placeholder"); '''
print r.findall(s)
说明:
r = re.compile( compile regular expression
r' treat this string as a regular expression
(?<=PDFObject) the match I want happens right after PDFObject
.*? then there may be some other characters...
url: followed by the string url:
.*? then match whatever follows until you get to the first instance (`?` : non-greedy match of
(http:.*?)" match the string http: up to (but not including) the first "
', end of regex string, but there's more...
re.DOTALL) set the DOTALL flag - this means the dot matches all characters
including newlines. This allows the match to continue from one line
to the next in the .*? right after the lookbehind
答案 5 :(得分:0)
答案 6 :(得分:0)
如果'PDFObject('
是页面中的唯一标识符,则只需匹配下一个引用的内容。
使用DOTALL flag(re.DOTALL
或re.S
)和非贪婪的明星(*?
),您可以写道:
import re
snippet = '''
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
id: "pdfObjectContainer",
width: "100%",
height: "700px",
pdfOpenParams: {
navpanes: 0,
statusbar: 1,
toolbar: 1,
view: "FitH"
}
}).embed("pdf_placeholder");
'''
# First version using unnamed groups
RE_UNNAMED = re.compile(r'PDFObject\(.*?"(.*?)"', re.S)
# Second version using named groups
RE_NAMED = re.compile(r'PDFObject\(.*?"(?P<url>.*?)"', re.S)
RE_UNNAMED.search(snippet, re.S).group(1)
RE_NAMED.search(snippet, re.S).group('url')
# result for both: 'http://www.site.com/doc55.pdf'
如果您不想编译正则表达式,因为它只使用一次,只需使用以下语法:
re.search(r'PDFObject\(.*?"(.*?)"', snippet, re.S).group(1)
re.search(r'PDFObject\(.*?"(?P<url>.*?)"', snippet, re.S).group('url')
四种选择,一种应该符合你的需要和品味!