我有一个Scrapy的爬虫设置,我正在尝试处理链接。问题是链接嵌入在Javascript中,我正在努力创建一个正则表达式。以下是我正在尝试处理的3个样本:
javascript:openInIFrame('main', 'setup.phtml%3f.op%3d3800%26.who%3dAAAAAAAAAAAA%26.menuItemRefNo=118')
javascript:window.open('overview.phtml?&.who=AAAAAAAAAAAA&.id=2', '43425235', 'menubar=no,toolbar=no,location=no,resizable=yes,maximize=yes');
javascript:openInIFrame('main', "page.phtml%3f.op%3d1499%26.who%3dAAAAAAAAAAAA%26.ifmod%3dtest&.menuItemRefNo=7")
每个的结果相对URL将在单引号/双引号之间:
setup.phtml%3f.op%3d3800%26.who%3dAAAAAAAAAAAA%26.menuItemRefNo=118
overview.phtml?&.who=AAAAAAAAAAAA&.id=2
page.phtml%3f.op%3d1499%26.who%3dAAAAAAAAAAAA%26.ifmod%3dtest&.menuItemRefNo=7
我尝试了'(.*?)'
和(["'])(?:(?=(\\?))\2.)*?\1
的各种变体,但似乎无法做到正确。我在这里缺少什么?
答案 0 :(得分:0)
答案 1 :(得分:0)
试试这个
import re
url_regex = re.compile(r"(?:javascript:openInIFrame\('main',|javascript:window.open\()\s*(?:'|\")([^'\"]+)(?:'|\")")
samples = [
"javascript:openInIFrame('main', 'setup.phtml%3f.op%3d3800%26.who%3dAAAAAAAAAAAA%26.menuItemRefNo=118')",
"javascript:window.open('overview.phtml?&.who=AAAAAAAAAAAA&.id=2', '43425235', 'menubar=no,toolbar=no,location=no,resizable=yes,maximize=yes');",
"javascript:openInIFrame('main', \"page.phtml%3f.op%3d1499%26.who%3dAAAAAAAAAAAA%26.ifmod%3dtest&.menuItemRefNo=7\")"
]
for sample in samples:
md = url_regex.search(sample)
if md:
print md.group(1)
else:
print 'NO MATCH'
对我来说,这会输出:
setup.phtml%3f.op%3d3800%26.who%3dAAAAAAAAAAAA%26.menuItemRefNo=118
overview.phtml?&.who=AAAAAAAAAAAA&.id=2
page.phtml%3f.op%3d1499%26.who%3dAAAAAAAAAAAA%26.ifmod%3dtest&.menuItemRefNo=7
诀窍是([^'\"]+)
。这会捕获一个或多个字符的任何序列,只要该字符不是双引号或单引号即可。所以基本上,一切都在URL字符串的末尾,这正是URL。请注意,\"
只是必需的,因为正则表达式本身是以"
分隔的