Question

我刚开始使用python上的html页面，我正在尝试创建一个非常简单的Web爬虫。我已经设法下载了我正在处理的网站上的所有链接，但要完全脱机，我需要将网站上的所有网址替换为本地地址，例如：我在下一条路径下保存了“www.domain.com/news”页面：“myfile / sub0 / 0” 我如何使用python来替换我下载到地址的每个html页面中的URL？我已经使用这个正则表达式获得了链接列表：

urls = re.findall('href=[\'"]?(http://[^\'" >]+)', htmlSource)

Answer 1

从目录中读取每个html文件，并执行如下所示的正则表达式替换，将URL更改为您的URL。以下是更改href链接的示例。

response = """
<a class="abc" href="http://www.example.com/abc.py">link a</a>
<a class="xyz" href="/xyz.py">link x</a>
"""
response = re.sub("(<a [^>]*href\s*=\s*['\"])(https?://www\.example\.com)?/?", "\\1myfile/sub0/0/", response)
print response;

输出：

<a class="abc" href="myfile/sub0/0/abc.py">link a</a>
<a class="xyz" href="myfile/sub0/0/xyz.py">link x</a>

注意，您可能需要根据需要调整正则表达式。

使用python将HTML页面中的所有链接替换为本地地址

1 个答案: