做BeautifulSoup(source_code,'html.parser')时“ html.parser”是什么意思?

时间:2019-07-20 12:49:26

标签: python python-3.x beautifulsoup

我没有得到BeautifulSoup的语法,尤其是括号内的HTML解析器的目的。

BeautifulSoup(source_code, 'html.parser')

2 个答案:

答案 0 :(得分:0)

这似乎是定义要用于解析source_code的库的定义。检出选项in the docs及其比较方式。

据我了解,“ html.parser”将使用在here中找到的Python3 html模块。

更多关于解析器的内容:

答案 1 :(得分:0)

您可以签出BeautifulSoup source code来了解构造函数参数及其用法。这是BeautifulSoup类__init__.py的代码:

def __init__(self, markup="", features=None, builder=None,
             parse_only=None, from_encoding=None, exclude_encodings=None,
             **kwargs):
    ...
    if builder is None:
        original_features = features
        if isinstance(features, basestring):
            features = [features]
        if features is None or len(features) == 0:
            features = self.DEFAULT_BUILDER_FEATURES
        builder_class = builder_registry.lookup(*features)
        if builder_class is None:
            raise FeatureNotFound(
                "Couldn't find a tree builder with the features you "
                "requested: %s. Do you need to install a parser library?"
                % ",".join(features))
        builder = builder_class()
        if not (original_features == builder.NAME or
                original_features in builder.ALTERNATE_NAMES):
            if builder.is_xml:
                markup_type = "XML"
            else:
                markup_type = "HTML"

第一个参数是标记代码(例如HTML代码),第二个参数指定how to parse that markup,默认参数是内置HTML解析器,但可以覆盖它:

  

您可以通过指定以下一项来覆盖它:

     
      
  • 您想解析哪种类型的标记。当前支持的是“ html”,“ xml”和“ html5”。
  •   
  • 要使用的解析器库的名称。当前支持的选项是“ lxml”,“ html5lib”和“ html.parser”(Python的内置HTML解析器)。
  •