新的lxml - 由xpath选择给出太多结果

时间:2014-11-03 01:52:08

标签: python lxml

我面临的问题是在下面的代码中,可能很常见。

基本上我想从页面上的子节点中选择xpath,但它在整个页面上给我所有的xpath匹配。

是什么给了什么?

import lxml.html

def readHTML(arg):
    ret = ""
    ret = lxml.html.parse(arg)
    return ret

soup = (readHTML("http://www.myScrapingSite.com/"))

subGroup =  soup.xpath("//div[@class='colmask']")[0]

#i want this to only be the cities in subGroup but its 
#giving me the cities on the entire page..what am I doing wrong?
cities = subGroup.xpath('//li/a')
urls = {}

#so basically I am building a dictionary that is a superset of the desired set
for city in cities:
    print city.attrib['href']
    urls[city.attrib['href']] = 1

for url in urls:
    subGroup2 = readHTML(url)

1 个答案:

答案 0 :(得分:2)

问题是//表示相对于文档根目录,即使对于子组也是如此。您真正想要的可能是.//,它相对于当前节点

cities = subGroup.xpath('.//li/a')

这是一个例子

>> xmlString = '<root><taga name="a"><tagb name="first"/></taga><taga name="b"><tagb name="second"/></taga></root>'
>> xml = lxml.etree.fromstring(xmlString)
>> taga = x.xpath('//taga[@name="a"]')[0]
>> taga[0].xpath('//tagb')
[<Element tagb at 7fddaa625310>, <Element tagb at 7fddaa6252b8>]
>> taga[0].xpath('.//tagb')
[<Element tagb at 7fddaa625310>]

您可以看到//返回两个tagb条目,而.//只返回当前节点内的一个。