Question

>>> from lxml import html
>>> html.tostring(html.fromstring('<div>1</div><div>2</div>'))
'<div><div>1</div><div>2</div></div>'   # I dont want to outer <div>
>>> html.tostring(html.fromstring('I am pure text'))
'<p>I am pure text</p>'  # I dont need the extra <p>

如何避免lxml中的外<div>和<p>？

Answer 1

默认情况下，lxml will create a parent div when the string contains multiple elements。

您可以使用单个片段：

{% with ignorevar=Counter.increment %}{% endwith %}

输出：

from lxml import html
test_cases = ['<div>1</div><div>2</div>', 'I am pure text']
for test_case in test_cases:
    fragments = html.fragments_fromstring(test_case)
    print(fragments)
    output = ''
    for fragment in fragments:
        if isinstance(fragment, str):
            output += fragment
        else:
            output += html.tostring(fragment).decode('UTF-8')
    print(output)

避免在lxml中包含外部元素

1 个答案: