Question

通常情况下，我正在努力解决缺乏适当的lxml文档问题（请注意自己：应该编写一个合适的lmxl教程并获得大量流量！）。

我想查找不包含特定类的<li>标记的所有<a>项。

例如：

<ul>
<li><small>pudding</small>: peaches and <a href="/cream">cream</a></li>
<li><small>cheese</small>: Epoisses and <a href="/st-marcellin" class="new">St Marcellin</a></li>
</ul>

我只想抓住不包含类<li>链接的new，我想抓住<small>内的文字。换句话说，'布丁'。

有人可以帮忙吗？

谢谢！

Answer 1

import lxml.html as lh

content='''\
<ul>
<li><small>pudding</small>: peaches and <a href="/cream">cream</a></li>
<li><small>cheese</small>: Epoisses and <a href="/st-marcellin" class="new">St Marcellin</a></li>
</ul>
'''

tree=lh.fromstring(content)
for elt in tree.xpath('//li[not(descendant::a[@class="new"])]/small/text()'):
    print(elt)

# pudding

XPath具有以下含义：

//                        # from the root node, look at all descendants
li[                       # select nodes of type <li> who
    not(descendant::a[    # do not have a descendant of type <a>
        @class="new"])]   # with a class="new" attribute 
    /small                # select the node of type <small>
    /text()               # return the text of that node

Answer 2

快速攻击这段代码：

from lxml import etree
from lxml.cssselect import CSSSelector

str = r"""
<ul>
<li><small>pudding</small>: peaches and <a href="/cream">cream</a></li>
<li><small>cheese</small>: Epoisses and <a href="/st-marcellin" class="new">St Marcellin</a></li>
</ul>"""

html = etree.HTML(str)

bad_sel = CSSSelector('li > a.new')
good_sel = CSSSelector('li > small')

bad = [item.getparent() for item in bad_sel(html)]
good = filter(lambda item: item.getparent() not in bad, [item for item in good_sel(html)])

for item in good:
  print(item.text)

首先构建一个你不想想要的项目列表，然后通过排除坏项目来构建你想要的项目。

lxml：如何丢弃包含特定类链接的所有<li>元素？</li>

2 个答案: