我面临的问题是在下面的代码中,可能很常见。
基本上我想从页面上的子节点中选择xpath,但它在整个页面上给我所有的xpath匹配。
是什么给了什么?import lxml.html
def readHTML(arg):
ret = ""
ret = lxml.html.parse(arg)
return ret
soup = (readHTML("http://www.myScrapingSite.com/"))
subGroup = soup.xpath("//div[@class='colmask']")[0]
#i want this to only be the cities in subGroup but its
#giving me the cities on the entire page..what am I doing wrong?
cities = subGroup.xpath('//li/a')
urls = {}
#so basically I am building a dictionary that is a superset of the desired set
for city in cities:
print city.attrib['href']
urls[city.attrib['href']] = 1
for url in urls:
subGroup2 = readHTML(url)
答案 0 :(得分:2)
问题是//
表示相对于文档根目录,即使对于子组也是如此。您真正想要的可能是.//
,它相对于当前节点
cities = subGroup.xpath('.//li/a')
这是一个例子
>> xmlString = '<root><taga name="a"><tagb name="first"/></taga><taga name="b"><tagb name="second"/></taga></root>'
>> xml = lxml.etree.fromstring(xmlString)
>> taga = x.xpath('//taga[@name="a"]')[0]
>> taga[0].xpath('//tagb')
[<Element tagb at 7fddaa625310>, <Element tagb at 7fddaa6252b8>]
>> taga[0].xpath('.//tagb')
[<Element tagb at 7fddaa625310>]
您可以看到//
返回两个tagb
条目,而.//
只返回当前节点内的一个。