Question

如何从python中压缩（最小化）HTML;我知道我可以使用一些正则表达式去除空格和其他东西，但我想要一个使用纯python的真正的编译器（因此它可以在Google App Engine上使用）。

我在在线html压缩器上进行了测试，它节省了65％的html大小。我想要那个，但是来自python。

Answer 1

您可以使用htmlmin缩小html：

import htmlmin

html = """
<!DOCTYPE html>
<html lang="en">
<head>
  <title>Bootstrap Case</title>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css">
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.1.1/jquery.min.js"></script>
  <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js"></script>
</head>
<body> 
<div class="container">
  <h2>Well</h2>
  <div class="well">Basic Well</div>
</div>
</body>
</html>
"""

minified = htmlmin.minify(html.decode("utf-8"), remove_empty_space=True)
print(minified)

Answer 2

我认为在GAE中没有真正需要缩小你的HTML，因为GAE已经gzip它Caching & GZip on GAE (Community Wiki)

我没有测试，但是html的缩小版本可能只赢得1％的大小，因为只有在两个版本都被压缩后才会删除空间。

如果你想保存存储，例如通过memcached存储，你更有兴趣gzip它（即使在低压缩级别）比删除空间更好，因为在python中它可能更小更快，因为在C中处理而不是纯蟒蛇

Answer 3

htmlmin和html_slimmer是python的一些简单的html缩小工具。我有数以百万计的html页面存储在我的数据库中并运行htmlmin，我能够将页面大小减少5到50％。他们都没有在完全的html缩小时做到最佳工作（即字体颜色＃00000可以减少到＃000），但这是一个好的开始。我有一个运行htmlmin的try / except块然后如果失败，html_slimmer因为htmlmin似乎提供更好的压缩，但它不支持非ascii字符。

示例代码：

import htmlmin
from slimmer import html_slimmer # or xhtml_slimmer, css_slimmer
try:
    html=htmlmin.minify(html, remove_comments=True, remove_empty_space=True)
except:
    html=html_slimmer( html.strip().replace('\n',' ').replace('\t',' ').replace('\r',' ')  )

祝你好运！

Answer 4

import htmlmin

code='''<body>
    Hello World
    <div style='color:red;'>Hi</div>
    </body>
'''

htmlmin.minify(code)

最后一行输出

<body> Hello World <div style=color:red;>Hi</div> </body>

您可以使用此代码删除空格

htmlmin.minify(code,remove_empty_space=True)

Answer 5

我编写了一个构建脚本，将我的模板复制到另一个目录中，然后我使用这个技巧告诉我的应用程序在开发模式或生产中选择正确的模板：

DEV = os.environ['SERVER_SOFTWARE'].startswith('Development') and not PRODUCTION_MODE

TEMPLATE_DIR = 'templates/2012/head/' if DEV else 'templates/2012/output/'

它是否被您的网络服务器gzip压缩不是真的那么重要，您应该出于性能原因保存每个字节。

如果你看一些最大的网站，他们经常做一些事情，比如编写无效的html来保存字节，例如，在html标签的id属性中省略双引号是常见的，例如：

<did id=mydiv> ... </div>

而不是：

<did id="mydiv"> ... </div>

并且有几个这样的例子，但我认为这不在线程的范围内。

回到这个问题，我整理了一个简化你的HTML，CSS和JS的构建脚本。警告：它不包括PRE标签的情况。

import os
import re
import sys

from subprocess import call

HEAD_DIR = 'templates/2012/head/'

OUT_DIR = 'templates/2012/output/'

REMOVE_WS = re.compile(r"\s{2,}").sub

YUI_COMPRESSOR = 'java -jar tools/yuicompressor-2.4.7.jar '

CLOSURE_COMPILER = 'java -jar tools/compiler.jar  --compilation_level ADVANCED_OPTIMIZATIONS '

def ensure_dir(f):
    d = os.path.dirname(f)
    if not os.path.exists(d):
        os.makedirs(d)

def getTarget(fn):
  return fn.replace(HEAD_DIR, OUT_DIR)

def processHtml(fn, tg):
  f = open(fn, 'r')
  content = f.read()
  content = REMOVE_WS(" ", content)
  ensure_dir(tg)
  d = open(tg, 'w+')
  d.write(content)
  content

def processCSS(fn, tg):
  cmd = YUI_COMPRESSOR + fn + ' -o ' + tg
  call(cmd, shell=True)
  return

def processJS(fn, tg):
  cmd = CLOSURE_COMPILER + fn + ' --js_output_file ' + tg
  call(cmd, shell=True)
  return

# Script starts here.
ensure_dir(OUT_DIR)
for root, dirs, files in os.walk(os.getcwd()):
  for dir in dirs:
    print "Processing", os.path.join(root, dir)
  for file in files:
    fn = os.path.join(root) + '/' + file
    if fn.find(OUT_DIR) > 0:
      continue
    tg = getTarget(fn)
    if file.endswith('.html'):
      processHtml(fn, tg)
    if file.endswith('.css'):
      processCSS(fn, tg)
    if file.endswith('.js'):
      processJS(fn, tg)

从python压缩（最小化）HTML

5 个答案: