如何从html树中拆分标签

时间:2012-01-09 12:19:00

标签: python beautifulsoup lxml

这是我的html树

 <li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
    Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
   </h3>Get the IndianOil Citibank <b>Card</b>. Apply Now! 
   <br />
   <a href="e%253DGOOGLE ------">Get 10X Rewards On Shopping</a> -
   <a href="S%2526eOfferCode%253DGSCCSLEX ------">Save Over 5% On Fuel</a>
   <br />
   <cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>

从这个html我需要提取&lt;之前的行。 br&gt;标签

第1行:获取IndianOil Citibank卡。现在申请!

第二行:购物获得10倍奖励 - 节省超过5%的燃料

它应该如何在python中完成?

4 个答案:

答案 0 :(得分:4)

我想你只是在每个<br/>之前询问了这一行。

以下代码将针对您提供的示例执行此操作,方法是删除<b><a>标记并打印.tail每个元素的following-sibling<br/>

from lxml import etree

doc = etree.HTML("""
<li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
    Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
   </h3>Get the IndianOil Citibank <b>Card</b>. Apply Now! 
   <br />
   <a href="e%253DGOOGLE ------">Get 10X Rewards On Shopping</a> -
   <a href="S%2526eOfferCode%253DGSCCSLEX ------">Save Over 5% On Fuel</a>
   <br />
   <cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>""")

etree.strip_tags(doc,'a','b')

for element in doc.xpath('//*[following-sibling::*[name()="br"]]'):
  print repr(element.tail.strip())

收率:

'Get the IndianOil Citibank Card. Apply Now!'
'Get 10X Rewards On Shopping -\n   Save Over 5% On Fuel'

答案 1 :(得分:1)

无需转发<br>代码的解决方案:

import lxml.html

html = "..."
tree = lxml.html.fromstring(html)
line1 = ''.join(tree.xpath('//li[@class="taf"]/text() | b/text()')[:3]).strip()
line2 = ' - '.join(tree.xpath('//li[@class="taf"]//a[not(@id)]/text()'))

答案 2 :(得分:1)

与HTML的所有解析一样,您需要对HTML的格式做出一些假设。如果我们可以假设前一行是<br>标记之前的所有内容,直到块级别标记或其他<br>,那么我们可以执行以下操作...

from BeautifulSoup import BeautifulSoup

doc = """
   <li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
    Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
    </h3>Get the IndianOil Citibank <b>Card</b>. Apply Now!
    <br />
    <a href="e%253DGOOGLE ------">Get 10X Rewards On Shopping</a> -
    <a href="S%2526eOfferCode%253DGSCCSLEX ------">Save Over 5% On Fuel</a>
    <br />
    <cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>
"""

soup = BeautifulSoup(doc)

现在我们已经解析了HTML,接下来我们定义了我们不想将其视为行的一部分的标记列表。实际上还有其他块标签,但这适用于此HTML。

block_tags = ["div", "p", "h1", "h2", "h3", "h4", "h5", "h6", "br"]

我们循环遍历每个<br>标记,逐步回过它的兄弟姐妹,直到我们不再拥有,或者我们点击了一个块级别标记。每次循环时,我们都会将节点添加到我们行的前面。 NavigableStrings没有name属性,但我们想要包含它们,因此在while循环中进行两部分测试。

for node in soup.findAll("br"):
    line = ""
    sibling = node.previousSibling
    while sibling is not None and (not hasattr(sibling, "name") or sibling.name not in block_tags):
        line = unicode(sibling) + line
        sibling = sibling.previousSibling
    print line

答案 3 :(得分:0)

我不知道你是想用lxml还是美味的汤。但对于使用xpath的lxml,这是一个例子

import lxml
from lxml import etree
import urllib2

response = urllib2.urlopen('your url here')
html = response.read()
imdb = etree.HTML(html)
titles = imdb.xpath('/html/body/li/a/text()')//xpath for "line 2" data.[use firebug]

我使用的xpath是针对您提供的html代码段。它可能会在原始背景下发生变化。

您也可以尝试cssselect in lxml

import lxml.html
import urllib
data = urllib.urlopen('your url').read() 
doc = lxml.html.fromstring(data)
elements = doc.cssselect('your csspath here') // CSSpath[using firebug extension]
for element in elements:
      print element.text_content()