通常情况下,我正在努力解决缺乏适当的lxml文档问题(请注意自己:应该编写一个合适的lmxl教程并获得大量流量!)。
我想查找不包含特定类的<li>
标记的所有<a>
项。
例如:
<ul>
<li><small>pudding</small>: peaches and <a href="/cream">cream</a></li>
<li><small>cheese</small>: Epoisses and <a href="/st-marcellin" class="new">St Marcellin</a></li>
</ul>
我只想抓住不包含类<li>
链接的new
,我想抓住<small>
内的文字。换句话说,'布丁'。
有人可以帮忙吗?
谢谢!
答案 0 :(得分:2)
import lxml.html as lh
content='''\
<ul>
<li><small>pudding</small>: peaches and <a href="/cream">cream</a></li>
<li><small>cheese</small>: Epoisses and <a href="/st-marcellin" class="new">St Marcellin</a></li>
</ul>
'''
tree=lh.fromstring(content)
for elt in tree.xpath('//li[not(descendant::a[@class="new"])]/small/text()'):
print(elt)
# pudding
XPath具有以下含义:
// # from the root node, look at all descendants
li[ # select nodes of type <li> who
not(descendant::a[ # do not have a descendant of type <a>
@class="new"])] # with a class="new" attribute
/small # select the node of type <small>
/text() # return the text of that node
答案 1 :(得分:0)
快速攻击这段代码:
from lxml import etree
from lxml.cssselect import CSSSelector
str = r"""
<ul>
<li><small>pudding</small>: peaches and <a href="/cream">cream</a></li>
<li><small>cheese</small>: Epoisses and <a href="/st-marcellin" class="new">St Marcellin</a></li>
</ul>"""
html = etree.HTML(str)
bad_sel = CSSSelector('li > a.new')
good_sel = CSSSelector('li > small')
bad = [item.getparent() for item in bad_sel(html)]
good = filter(lambda item: item.getparent() not in bad, [item for item in good_sel(html)])
for item in good:
print(item.text)
首先构建一个你不想想要的项目列表,然后通过排除坏项目来构建你想要的项目。