BeautifulSoup / LXML.html:如果孩子看起来像x,则删除标签及其子项

时间:2011-10-06 20:00:04

标签: python beautifulsoup lxml

我遇到了解决问题的难题。如果<question> = 99,我想删除<answer>及其子项。因此,我需要一个包含已过滤问题的字符串。我有以下html结构:

<html>
 <body>        
  <questionaire>
   <question>
    <questiontext>
     Do I have a question?
    </questiontext>
    <answer>
     99
    </answer>
   </question>
   <question>
    <questiontext>
     Do I love HTML/XML parsing?
    </questiontext>
    <questalter>
     <choice>
      1 oh god yeah
     </choice>
     <choice>
      2 that makes me feel good
     </choice>
     <choice>
      3 oh hmm noo
     </choice>
     <choice>
      4 totally
     </choice>
     </questalter>
     <answer>
      4
    </answer>
   </question>
   <question>
  </questionaire>
 </body>
</html>      

到目前为止,我试图用xpath实现它...但是lxml.html没有iterparse ......有吗?感谢名单!

2 个答案:

答案 0 :(得分:1)

这将完全符合您的需求:

from xml.dom import minidom

doc = minidom.parseString(text)
for question in doc.getElementsByTagName('question'):
    for answer in question.getElementsByTagName('answer'):
        if answer.childNodes[0].nodeValue.strip() == '99':
            question.parentNode.removeChild(question)

print doc.toxml()

结果:

<html>
 <body>        
  <questionaire>

   <question>
    <questiontext>
     Do I love HTML/XML parsing?
    </questiontext>
    <questalter>
     <choice>
      1 oh god yeah
     </choice>
     <choice>
      2 that makes me feel good
     </choice>
     <choice>
      3 oh hmm noo
     </choice>
     <choice>
      4 totally
     </choice>
     </questalter>
     <answer>
      4
    </answer>
   </question>
  </questionaire>
 </body>
</html>

答案 1 :(得分:1)

from lxml import etree
html = etree.fromstring(html_string)
questions = html.xpath('/html/body/questionaire/question')
for question in questions:
    for elements in question.getchildren():
        if element.tag == 'answer' and '99' in element.text:
            html.xpath('/html/body/questionaire')[0].remove(question)
print etree.tostring(html)