如何用lxml解析html文件中的属性名称?

时间:2014-02-14 17:03:30

标签: html parsing replace lxml

我有一个xml文件,我需要替换类属性的值,这取决于每个p元素的dfn文本。所以,我有一个这样的html文件:

<html>
  <head></head>
  <body>
    <p class ='person'><dfn>New-York</dfn>
    <p class = 'place'><dfn>John Doe</dfn>
  </body>
</html>

我想解析这个文档,并用正确的属性替换所有类属性的值。为了定义dfn-text是一个地方还是人,我的脚本中已经有一组条件。所以,我希望获得与输出相同的html文件,但使用正确的类:

<html>
  <head></head>
  <body>
    <p class ='**place**'><dfn>New-York</dfn>
    <p class = '**person**'><dfn>John Doe</dfn>
  </body>
</html>

到目前为止,我试图实现它寻找dfn的祖先p及其属性&#39; class&#39;,然后尝试用replace()函数替换它,但它没有&#39;真的有用:

filename = open('file.html', 'r+')
tree = etree.parse(filename)

def f1():
  for dfn in tree.getiterator('dfn'):
    def_text = dfn.text
    if def_text == 'New York'  #a list of conditions in my real script, New York is an example only):

      class1 = ''.join(dfn.xpath('ancestor::p//@class') 

      filename.write(class1.replace('person', 'place'))

我得到的只是同一个文件,但有一行&#39; place&#39;附上一个结尾...

1 个答案:

答案 0 :(得分:0)

使用lxml with xslt转换您的html,例如:

from lxml import etree

h = '''<html>
  <head></head>
  <body>
    <p class ='person'><dfn>New-York</dfn></p>
    <p class = 'place'><dfn>John Doe</dfn></p>
  </body>
</html>'''
doc = etree.fromstring(h, etree.HTMLParser())

xsl = '''<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>
 <xsl:template match="p">
    <xsl:variable name="original-class" select="string(@class)" />
    <xsl:copy>
        <xsl:apply-templates select="@*"/>
        <xsl:if test="dfn[text()='New-York']">
          <xsl:attribute name="class">
            <xsl:value-of select="concat('**', $original-class, '**')"/>
          </xsl:attribute>
        </xsl:if>
        <xsl:apply-templates select="node()"/>
    </xsl:copy>
   </xsl:template>
</xsl:stylesheet>'''
xslt_root = etree.XML(xsl)
transform = etree.XSLT(xslt_root)
result_tree = transform(doc)
print result_treeoutput:

输出:

$ python x.py 
<html>
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></head>
  <body>
    <p class="**person**"><dfn>New-York</dfn></p>
    <p class="place"><dfn>John Doe</dfn></p>
  </body>
</html>