Question

有人能举例说明如何使用http://code.google.com/p/streamhtmlparser从html文档中解析出所有A标记href吗？（无论是C ++代码还是python代码都可以，但我更喜欢使用python绑定的示例）

我可以看到它在python测试中是如何工作的，但是他们期望html中已有特殊标记，它会检查状态值。在提供解析器普通html时，我没有看到如何在状态更改期间获得正确的回调。

我可以使用以下代码获取我正在寻找的一些信息，但是我需要一次提供html块而不仅仅是字符，我需要知道它何时完成了标记，属性等不只是它在标签，属性或价值中。

import py_streamhtmlparser
parser = py_streamhtmlparser.HtmlParser()
html = """<html><body><a href='http://google.com'>link</a></body></html>"""
for index, character in enumerate(html):
   parser.Parse(character)
   print index, character, parser.Tag(), parser.Attribute(), parser.Value(), parser.ValueIndex()

您可以看到此代码的示例运行here

Answer 1

import py_streamhtmlparser
parser = py_streamhtmlparser.HtmlParser()
html = """<html><body><a href='http://google.com' id=100>
        link</a><p><a href=heise.de/></body></html>"""
cur_attr = cur_value = None
for index, character in enumerate(html):
   parser.Parse(character)
   if parser.State() == py_streamhtmlparser.HTML_STATE_VALUE:
      # we are in an attribute value. Record what we got so far
      cur_tag = parser.Tag()
      cur_attr = parser.Attribute()
      cur_value = parser.Value()
      continue
   if cur_value:
      # we are not in the value anymore, but have seen one just before
      print "%r %r %r" % (cur_tag, cur_attr, cur_value)
      cur_value = None

给出

'a' 'href' 'http://google.com'
'a' 'id' '100'
'a' 'href' 'heise.de/'

如果您只想要href属性，请在打印点检查cur_attr。

编辑：Python绑定目前不支持任何类型的事件回调。因此，唯一可用的输出是处理相应输入结束时的状态。要更改它，可以使用回调函数扩充htmlparser.c：exit_attr（等）。但是，这实际上不是streamhtmlparser的目的 - 它是一个模板引擎，你在源代码中有标记，并且你可以按字符处理输入字符。

使用streamhtmlparser的示例

1 个答案: