Question

我试图遍历一个html字符串并连接文本内容，字符串连接符随遇到的html标记类型而变化。

示例html： html_str='<td>This is how we parse our string together</td>'

我写了一个名为smart_itertext()的辅助函数来通过方法e遍历html元素e.iter()。对于tag中的每个e.iter()，它会检查标记，然后附加.text或.tail内容。

我的挑战是让尾部文字显示在正确的位置。当我按标记进行迭代时，我会到达，这似乎是我唯一一次同时访问结尾文本的机会。

期望的结果：

>>>smart_itertext(lxml.html.fromstring(html_str))
'This is how::we::parse::our string::together'

实际结果：

>>>smart_itertext(lxml.html.fromstring(html_str))
'This is how:: together::::we::parse::::our string'

这是我的功能：

def smart_itertext(tree, cross_joiner='::'):
empty_join= ['strong','b','em','i','small','marked','deleted',
            'ins', 'sub','sup']
cross_join = ['td','tr','br','p']
output=''
for element in tree.iter():
    if element.tag in empty_join:
        if element.text:
            output += element.text
        if element.tail:
            output += element.tail
    elif element.tag in cross_join:
        if element.text:
            output += cross_joiner + element.text
        else:
            output += cross_joiner
        if element.tail:
            output += cross_joiner + element.tail
    else:
        print ('unknown tag in smart_itertext:',element.tag)
return output

实现这个目标的正确方法是什么？

python lxml.html：以文档字符串顺序使用.tail迭代文本的正确方法

0 个答案: