我正在尝试使用自定义解析器目标使用lxml解析页面,该目标将特定元素存储在列表中并返回其余元素。
但是我在http://yahoo.com上遇到了一个奇怪的错误。
File "src\lxml\etree.pyx", line 3426, in lxml.etree.parse
File "src\lxml\parser.pxi", line 1861, in lxml.etree._parseDocument
File "src\lxml\parser.pxi", line 1881, in lxml.etree._parseFilelikeDocument
File "src\lxml\parser.pxi", line 1776, in lxml.etree._parseDocFromFilelike
File "src\lxml\parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
File "src\lxml\parsertarget.pxi", line 172, in lxml.etree._TargetParserContext._handleParseResultDoc
File "src\lxml\parsertarget.pxi", line 160, in lxml.etree._TargetParserContext._handleParseResultDoc
File "src\lxml\etree.pyx", line 316, in lxml.etree._ExceptionContext._raise_if_stored
File "src\lxml\saxparser.pxi", line 587, in lxml.etree._handleSaxTargetComment
File "src\lxml\parsertarget.pxi", line 97, in lxml.etree._PythonSaxParserTarget._handleSaxComment
File "src\lxml\saxparser.pxi", line 767, in lxml.etree.TreeBuilder.comment
File "src\lxml\saxparser.pxi", line 714, in lxml.etree.TreeBuilder._handleSaxComment
File "src\lxml\etree.pyx", line 3017, in lxml.etree.Comment
ValueError: Comment may not contain '--' or end with '-'
这里有目标代码,然后照常在lxml.etree.HTMLParser中使用该代码来解析页面数据。
class TokenProcessor(TreeBuilder):
"""
Processes the tag-elements and builds the tree.
Stores the events generated by the :meth:`process_child` method.
Must be subclassed for customisation of event generation.
:param factory: a element factory which creates tokens from
arguments.
"""
def __init__(self, factory=None, *args, **kwargs):
super().__init__(*args, **kwargs)
self.events = list()
self.factory = default_factory()
if factory:
self.set_factory(factory)
def set_factory(self, factory):
if not isinstance(factory, TokenFactory):
raise TypeError()
self.factory = factory
def end(self, tag):
"""Handles closing of child element tag.
It processes the generated child before returning it to
the document tree.
:param tag: tag name which is to be closed.
:return: generated child element.
"""
#: The original html element
child = super().end(tag)
if isinstance(child, _Element):
#: Events generated from the element
for pack in self.__process_child(child):
#: Event token consisting refs to parent builder and factory
# event = self.factory.create(
# tag, pack[0], pack[1], pack[2], pack[3],
# weakref.ref(self), weakref.ref(self.factory)
# )
event = self.factory.from_tuple(tag, pack)
if event is not None:
self.events.append(event)
#: Original element is returned
return child