避免在lxml中包含外部元素

时间:2016-07-20 01:42:01

标签: python html parsing lxml

>>> from lxml import html
>>> html.tostring(html.fromstring('<div>1</div><div>2</div>'))
'<div><div>1</div><div>2</div></div>'   # I dont want to outer <div>
>>> html.tostring(html.fromstring('I am pure text'))
'<p>I am pure text</p>'  # I dont need the extra <p>

如何避免lxml中的外<div><p>

1 个答案:

答案 0 :(得分:2)

默认情况下,lxml will create a parent div when the string contains multiple elements

您可以使用单个片段:

{% with ignorevar=Counter.increment %}{% endwith %}

输出:

from lxml import html
test_cases = ['<div>1</div><div>2</div>', 'I am pure text']
for test_case in test_cases:
    fragments = html.fragments_fromstring(test_case)
    print(fragments)
    output = ''
    for fragment in fragments:
        if isinstance(fragment, str):
            output += fragment
        else:
            output += html.tostring(fragment).decode('UTF-8')
    print(output)