Question

我正在编写一个python程序，需要能够在另一个站点上镜像内容。

下载html后，我需要用完整的链接替换所有相关链接（例如<img src='/foo.png'>）（例如<img src='http://thesitewherethepageisfrom.com/foo.png'>）。

我还需要替换所有相关文件路径。例如，如果我下载了http://example.com/bar/foo.php并且<img src='foobar.jpg'>，我实际上需要将其替换为<img src='http://example.com/bar/foobar.jpg'>而不是<img src='http://example.com/foobar.jpg'>。

我目前正在使用正则表达式：

((?<=src=[\"'])|(?<=href=.))(?!(http(s|)(:|%3[Aa])))[0-9A-Za-z%?&#_=+.~]([0-9A-Za-z%?&#_=+./~])*(?=['\"])

和

((?<=src=[\"'])|(?<=href=.))(?!(http(s|)(:|%3[Aa])))([0-9A-Za-z%?&#_=+./~])*(?=['\"])

表示不是完整链接的相对和完整文件路径。 python是否提供了将文本添加到每个正则表达式数学的方法？我需要能够遍历匹配并将http://example.com或http://example.com/bar/添加到每个匹配项中。

Python前置为正则表达式匹配

0 个答案: