Question

我使用以下内容获取部分的所有html内容以保存到数据库

el = doc.get_element_by_id('productDescription')
lxml.html.tostring(el)

产品说明的标签如下所示：

<div id='productDescription'>

     <THE HTML CODE I WANT>

</div>

代码效果很好，给了我所有的html代码但是如何删除外层，即<div id='productDescription'>和结束标记</div>？

Answer 1

您可以将每个孩子单独转换为字符串：

text = el.text
text += ''.join(map(lxml.html.tostring, el.iterchildren()))

或者以更加骇人的方式：

el.attrib.clear()
el.tag = '|||'
text = lxml.html.tostring(el)
assert text.startswith('<'+el.tag+'>') and text.endswith('</'+el.tag+'>')
text = text[len('<'+el.tag+'>'):-len('</'+el.tag+'>')]

Answer 2

如果您的productDescription div div包含混合文本/元素内容，例如

<div id='productDescription'>
  the
  <b> html code </b>
  i want
</div>

您可以使用xpath('node()')遍历获取内容（以字符串形式）：

s = ''
for node in el.xpath('node()'):
    if isinstance(node, basestring):
        s += node
    else:
        s += lxml.html.tostring(node, with_tail=False)

Answer 3

这是一个可以完成你想要的功能。

def strip_outer(xml):
    """
    >>> xml = '''<math xmlns="http://www.w3.org/1998/Math/MathML" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/1998/Math/MathML         http://www.w3.org/Math/XMLSchema/mathml2/mathml2.xsd">
    ...   <mrow>
    ...     <msup>
    ...       <mi>x</mi>
    ...       <mn>2</mn>
    ...     </msup>
    ...     <mo> + </mo>
    ...     <mi>x</mi>
    ...   </mrow>
    ... </math>'''
    >>> so = strip_outer(xml)
    >>> so.splitlines()[0]=='<mrow>'
    True

    """
    xml = xml.replace('xmlns=','xmlns:x=')#lxml fails with xmlns= attribute
    xml = '<root>\n'+xml+'\n</root>'#...and it can't strip the root element
    rx = lxml.etree.XML(xml)
    lxml.etree.strip_tags(rx,'math')#strip <math with all attributes
    uc=lxml.etree.tounicode(rx)
    uc=u'\n'.join(uc.splitlines()[1:-1])#remove temporary <root> again
    return uc.strip()

Answer 4

使用正则表达式。

dataframe['DSFS'].apply(replaceMonth(dataframe['DSFS']))

Python，lxml和使用lxml.html.tostring（el）删除外部标记

4 个答案: