使用自定义解析器目标时,lxml etree错误

时间:2019-12-06 14:41:29

标签: python html beautifulsoup html-parsing lxml

我正在尝试使用自定义解析器目标使用lxml解析页面,该目标将特定元素存储在列表中并返回其余元素。

但是我在http://yahoo.com上遇到了一个奇怪的错误。

  File "src\lxml\etree.pyx", line 3426, in lxml.etree.parse
  File "src\lxml\parser.pxi", line 1861, in lxml.etree._parseDocument
  File "src\lxml\parser.pxi", line 1881, in lxml.etree._parseFilelikeDocument
  File "src\lxml\parser.pxi", line 1776, in lxml.etree._parseDocFromFilelike
  File "src\lxml\parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
  File "src\lxml\parsertarget.pxi", line 172, in lxml.etree._TargetParserContext._handleParseResultDoc
  File "src\lxml\parsertarget.pxi", line 160, in lxml.etree._TargetParserContext._handleParseResultDoc
  File "src\lxml\etree.pyx", line 316, in lxml.etree._ExceptionContext._raise_if_stored
  File "src\lxml\saxparser.pxi", line 587, in lxml.etree._handleSaxTargetComment
  File "src\lxml\parsertarget.pxi", line 97, in lxml.etree._PythonSaxParserTarget._handleSaxComment
  File "src\lxml\saxparser.pxi", line 767, in lxml.etree.TreeBuilder.comment
  File "src\lxml\saxparser.pxi", line 714, in lxml.etree.TreeBuilder._handleSaxComment
  File "src\lxml\etree.pyx", line 3017, in lxml.etree.Comment
ValueError: Comment may not contain '--' or end with '-'

这里有目标代码,然后照常在lxml.etree.HTMLParser中使用该代码来解析页面数据。


class TokenProcessor(TreeBuilder):
    """
    Processes the tag-elements and builds the tree.
    Stores the events generated by the :meth:`process_child` method.
    Must be subclassed for customisation of event generation.

    :param factory: a element factory which creates tokens from
        arguments.
    """

    def __init__(self, factory=None, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.events = list()
        self.factory = default_factory()
        if factory:
            self.set_factory(factory)

    def set_factory(self, factory):
        if not isinstance(factory, TokenFactory):
            raise TypeError()
        self.factory = factory

    def end(self, tag):
        """Handles closing of child element tag.
        It processes the generated child before returning it to
        the document tree.

        :param tag: tag name which is to be closed.
        :return: generated child element.
        """
        #: The original html element
        child = super().end(tag)

        if isinstance(child, _Element):
            #: Events generated from the element
            for pack in self.__process_child(child):
                #: Event token consisting refs to parent builder and factory
                # event = self.factory.create(
                #     tag, pack[0], pack[1], pack[2], pack[3],
                #     weakref.ref(self), weakref.ref(self.factory)
                # )
                event = self.factory.from_tuple(tag, pack)
                if event is not None:
                    self.events.append(event)
        #: Original element is returned
        return child

0 个答案:

没有答案
相关问题