br.open('http://www.google.com/advanced_search')
br.select_form(name='f')
br.form['as_q'] = "lxml"
data = br.submit()
html_string = data.read() //this is my input
parser = etree.HTMLParser()
tree = etree.parse(StringIO(html_string), parser)
follow_urls = tree.xpath('//*[@id="nav"]/tbody/tr/td/a')
使用上面的代码从谷歌搜索结果中获取后续链接。但它返回空。
但是当我在控制台中做同样的事情时,我会得到链接
出了什么问题?
答案 0 :(得分:0)
你真的在HTML字符串中有table/tbody/tr
吗?
tbody
通常由您的浏览器插入,您可以在检查窗口中看到它。
您可以尝试:
tree.xpath('//*[@id="nav"]/tr/td/a')
或涵盖两种情况:
tree.xpath('(//*[@id="nav"]/tbody/tr | //*[@id="nav"]/tr)/td/a')
示例python shell会话:
>>> import pprint
>>> import lxml.html
>>> root = lxml.html.parse('http://www.google.fr/search?q=lxml').getroot()
>>> pprint.pprint(root.xpath('(//*[@id="nav"]/tbody/tr | //*[@id="nav"]/tr)/td/a/@href'))
['/search?q=lxml&ie=UTF-8&prmd=ivns&ei=UHsxU-6KK8nxhQfinYG4Ag&start=10&sa=N',
'/search?q=lxml&ie=UTF-8&prmd=ivns&ei=UHsxU-6KK8nxhQfinYG4Ag&start=20&sa=N',
'/search?q=lxml&ie=UTF-8&prmd=ivns&ei=UHsxU-6KK8nxhQfinYG4Ag&start=30&sa=N',
'/search?q=lxml&ie=UTF-8&prmd=ivns&ei=UHsxU-6KK8nxhQfinYG4Ag&start=40&sa=N',
'/search?q=lxml&ie=UTF-8&prmd=ivns&ei=UHsxU-6KK8nxhQfinYG4Ag&start=50&sa=N',
'/search?q=lxml&ie=UTF-8&prmd=ivns&ei=UHsxU-6KK8nxhQfinYG4Ag&start=60&sa=N',
'/search?q=lxml&ie=UTF-8&prmd=ivns&ei=UHsxU-6KK8nxhQfinYG4Ag&start=70&sa=N',
'/search?q=lxml&ie=UTF-8&prmd=ivns&ei=UHsxU-6KK8nxhQfinYG4Ag&start=80&sa=N',
'/search?q=lxml&ie=UTF-8&prmd=ivns&ei=UHsxU-6KK8nxhQfinYG4Ag&start=90&sa=N',
'/search?q=lxml&ie=UTF-8&prmd=ivns&ei=UHsxU-6KK8nxhQfinYG4Ag&start=10&sa=N']
>>>