Question

我正在创建一个给定网站URL的应用程序，将下载它，包括其依赖的CSS / JS / Images资源。它还将创建Web的脱机工作版本。我非常依赖Beautiful Soup来解析html响应文本，迭代它的所有外部资源（css / js / images / etc）并在本地下载。它有时很漂亮。有时候不要

我的研究案例是this website。我在离线 - 填充 - 工作版本的网络中遇到了这个问题： glitches

首先显示页脚。当然，很多因素都会导致这种麻烦。但是，从Chrome检查员的判断来看，不要抱怨任何缺失的资源，我不确定问题是什么。

虽然在我使用此代码检查Soup编写的HTML之后：

with open(os.path.join(save_to_dir, name + '_(Offline)' + '.html'), 'w') as fd:
            fd.write(soup.encode('utf-8'))

..，我意识到HTML被过度修改了。例如，下面是网络的实时版本：

<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8"/>
<title>HALL | Group Chat, Instant Messaging</title>
<meta name="description"
      content="Hall is group chat and IM for companies and teams. Available free for the web, desktop and mobile. FREE anytime, anywhere.">
<meta property="og:title" content="Hall"/>
<meta property="og:description" content="Real-time chat &amp; texting for business teams."/>
<meta property="og:image" content="https://d3bkj0l4dzdp7x.cloudfront.net/static-assets/hall_logo_400x200.jpg"/>
<meta http-equiv="X-UA-Compatible" content="chrome=1">

虽然Soup版本如下所示：

<head>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<title>HALL | Group Chat, Instant Messaging</title>
<meta content="Hall is group chat and IM for companies and teams. Available free for the web, desktop and mobile. FREE anytime, anywhere." name="description">
<meta content="Hall" property="og:title"/>
<meta content="Real-time chat &amp; texting for business teams." property="og:description"/>
<meta content="https://d3bkj0l4dzdp7x.cloudfront.net/static-assets/hall_logo_400x200.jpg" property="og:image"/>
<meta content="chrome=1" http-equiv="X-UA-Compatible">

以上只是一个例子。但是，我的问题是，如何实现 - 使用Soup--一个非常类似于其实时/输入版本的HTML结果？

Answer 1

beautifulsoup中标签的属性由字典保存。由于字典不使用索引，因此可以对属性进行混洗。换句话说，用bsoup做这件事是不可能的。

解决方案是：

获取网站。
将其保存在文件中。
使用beautifulsoup获取资源和资料。

有了这个，您将获得一份完整的答案副本作为您的文件，并且仍然可以使用bsoup而不会有任何减速。

Answer 2

比文森特的答案更详细一点

要获取该网站，如果您可以访问Linux，则可以使用

wget --random-wait --no-check-certificate -r -p -e robots=off -U mozilla https://hall.com/

到＆＃34;下载＆＃34;整个网站。

使用BeautifulSoup处理本地文件

您甚至可以编写一个简单的shell脚本并每天运行它以保存网站的快照。

提醒一下，请不要经常破坏网站！

如何使用Python Beautiful Soup获得完美的网站快照？

2 个答案: