Question

有没有办法使用BeautifulSoup从HTML文件中获取CSS类？示例摘录：

<style type="text/css">

 p.c3 {text-align: justify}

 p.c2 {text-align: left}

 p.c1 {text-align: center}

</style>

完美的输出将是：

cssdict = {
    'p.c3': {'text-align':'justify'},
    'p.c2': {'text-align:'left'},
    'p.c1':{'text-align':'center'}
}

虽然这样的事情可以做到：

L = [
    ('p.c3', {'text-align': 'justify'}),  
    ('p.c2', {'text-align': 'left'}),    
    ('p.c'1, {'text-align': 'center'})
]

Answer 1

BeautifulSoup本身根本不解析CSS样式声明，但可以提取这些部分，然后用专用的CSS解析器解析它们。

根据您的需要，有几个CSS解析器可用于python;我选择cssutils（需要python 2.5或更高版本（包括python 3）），它是最完整的支持，并且也支持内联样式。

其他选项包括css-py和tinycss。

要抓取并解析所有样式部分（使用cssutils的示例）：

import cssutils
sheets = []
for styletag in tree.findAll('style', type='text/css')
    if not styletag.string: # probably an external sheet
        continue
    sheets.append(cssutils.parseStyle(styletag.string))

使用cssutil，您可以将这些组合起来，解析导入，甚至可以获取外部样式表。

Answer 2

存在用于在python中显式解析CSS的tinycss解析器。 BeautifulSoup支持HTML标记，除非使用正则表达式，否则无法搜索特定的css类。这甚至支持一些CSS3。

http://packages.python.org/tinycss/

PS：但是，它仅适用于python 2.6以上版本。

Answer 3

BeautifulSoup＆amp; cssutils combo会很好地解决这个问题：

    from bs4 import BeautifulSoup as BSoup
    import cssutils
    selectors = {}
    with open(htmlfile) as webpage:
        html = webpage.read()
        soup = BSoup(html, 'html.parser')
    for styles in soup.select('style'):
        css = cssutils.parseString(styles.encode_contents())
        for rule in css:
            if rule.type == rule.STYLE_RULE:
                style = rule.selectorText
                selectors[style] = {}
                for item in rule.style:
                    propertyname = item.name
                    value = item.value
                    selectors[style][propertyname] = value

BeautifulSoup解析html（head＆amp; body）中的所有“style”标签，.encode_contents（）将BeautifulSoup对象转换为cssutils可以读取的字节格式，然后cssutils将各个CSS样式一直解析为通过rule.selectorText＆amp; amp;属性/值级别rule.style。

注意：“rule.STYLE_RULE”仅过滤样式。 cssutils documentation详细信息用于过滤媒体规则，注释和导入的选项。

如果你将其分解为功能，那就更清洁了，但你得到了要点......

BeautifulSoup：从html获取css类

3 个答案: