lxml:如何丢弃包含特定类链接的所有<li>元素?</li>

时间:2011-07-29 19:52:21

标签: python lxml

通常情况下,我正在努力解决缺乏适当的lxml文档问题(请注意自己:应该编写一个合适的lmxl教程并获得大量流量!)。

我想查找包含特定类的<li>标记的所有<a>项。

例如:

<ul>
<li><small>pudding</small>: peaches and <a href="/cream">cream</a></li>
<li><small>cheese</small>: Epoisses and <a href="/st-marcellin" class="new">St Marcellin</a></li>
</ul>

我只想抓住不包含类<li>链接的new,我想抓住<small>内的文字。换句话说,'布丁'。

有人可以帮忙吗?

谢谢!

2 个答案:

答案 0 :(得分:2)

import lxml.html as lh

content='''\
<ul>
<li><small>pudding</small>: peaches and <a href="/cream">cream</a></li>
<li><small>cheese</small>: Epoisses and <a href="/st-marcellin" class="new">St Marcellin</a></li>
</ul>
'''

tree=lh.fromstring(content)
for elt in tree.xpath('//li[not(descendant::a[@class="new"])]/small/text()'):
    print(elt)

# pudding

XPath具有以下含义:

//                        # from the root node, look at all descendants
li[                       # select nodes of type <li> who
    not(descendant::a[    # do not have a descendant of type <a>
        @class="new"])]   # with a class="new" attribute 
    /small                # select the node of type <small>
    /text()               # return the text of that node

答案 1 :(得分:0)

快速攻击这段代码:

from lxml import etree
from lxml.cssselect import CSSSelector

str = r"""
<ul>
<li><small>pudding</small>: peaches and <a href="/cream">cream</a></li>
<li><small>cheese</small>: Epoisses and <a href="/st-marcellin" class="new">St Marcellin</a></li>
</ul>"""

html = etree.HTML(str)

bad_sel = CSSSelector('li > a.new')
good_sel = CSSSelector('li > small')

bad = [item.getparent() for item in bad_sel(html)]
good = filter(lambda item: item.getparent() not in bad, [item for item in good_sel(html)])

for item in good:
  print(item.text)

首先构建一个你不想想要的项目列表,然后通过排除坏项目来构建你想要的项目。