Question

是否有一个Python库可以让我在没有骚扰标记的情况下获得任意HTML代码？据我所知，lxml，BeautifulSoup和pyquery都可以轻松实现像soup.find(".arbitrary-class")这样的东西，但它返回的HTML格式化了。我想要原始的原始标记。

例如，说我有这个：

<html> <head> <title>test</title> </head> <body> <div class="arbitrary-class"> This is some<br /> markup with <br> <p>some potentially problematic</p> stuff in it <input type="text" name="w00t"> </div> </body> </html>

我想完全捕获：

" This is some<br /> markup with <br> <p>some potentially problematic</p> stuff in it <input type="text" name="w00t"> "

...空格和所有，并且不会严格标记要正确格式化的标记（例如<br />）。

麻烦的是，似乎所有3个库似乎都在内部构造DOM，只是返回一个Python对象，表示文件应该而不是 ，所以我不知道在哪里/如何获得我需要的原始代码片段。

Answer 1

此代码：

from bs4 import BeautifulSoup
with open("index.html") as fp:
    soup = BeautifulSoup(fp, "html.parser")
    print soup.select(".arbitrary-class")[0].contents

将返回列表：

[u'\n      This is some', <br/>, u'\n      markup with ', <br/>, u'\n', <p>some potentially problematic</p>, u'\n      stuff in it ', <input name="w00t" type="text"/>, u'\n']

编辑：

丹尼尔在评论中指出，这会产生标准化的标签。

我能找到的唯一选择是使用解析器生成器，例如pyparsing。下面的代码稍微修改了withAttribute函数的example code部分内容。

from pyparsing import *

html = """<html>
<head>
    <title>test</title>
</head>
<body>
    <div class="arbitrary-class">
    This is some<br />
    markup with <br>
    <p>some potentially problematic</p>
    stuff in it <input type="text" name="w00t">
    </div>
</body>
</html>"""

div,div_end = makeHTMLTags("div")

# only match div tag having a class attribute with value "arbitrary-class"
div_grid = div().setParseAction(withClass("arbitrary-class"))
grid_expr = div_grid + SkipTo(div | div_end)("body")
for grid_header in grid_expr.searchString(html):
    print repr(grid_header.body)

此代码的输出如下：

'\n    This is some<br />\n    markup with <br>\n    <p>some potentially problematic</p>\n    stuff in it <input type="text" name="w00t">'

请注意，第一个<br/>现在有一个空格，<input>标记在结束＆gt;之前不再添加/。与您的规范的唯一区别是缺少尾随空格。您可以通过改进此解决方案来解决这一差异。

如何捕获HTML，不受捕获库的影响？

1 个答案: