浏览器加载项生成的合法Xpath查询不适用于urllib2-fetched页面

时间:2016-02-12 01:51:14

标签: python html xpath lxml urllib2

我希望从this页面中提取每个指令ID:

enter image description here

import lxml.html as lh
url ='https://secure.ssa.gov/apps10/reference.nsf/instructiontypecode!openview&restricttocategory=POMT'
response = urllib2.urlopen(url)
content = response.read()
root = lh.fromstring(content)
all_instruction_ids = root.xpath(XPATH_ALL_INSTRUCTION_IDS)

我尝试过Chrome&amp ;;无数的XPath表达式。 Firebug的开发人员工具,Firebug和其他浏览器插件:

XPATH_ALL_INSTRUCTION_IDS = '//*[@id="content"]/div/div/div[2]/table/tbody/tr/td[1]/font/a/.'
#XPATH_ALL_INSTRUCTION_IDS = '//*[@id="content"]/div/div/div[2]/table/tbody/tr/td[1]/font/a/text()'
XPATH_ALL_INSTRUCTION_IDS  = '//*[@id="content"]/div/div/div[2]/table/tbody/tr/td[1]/font/a[contains(normalize-space(), "")]'
XPATH_ALL_INSTRUCTION_IDS = '//*[@id="content"]/div/div/div[2]/table/tbody/tr/td[1]/font/a'
XPATH_ALL_INSTRUCTION_IDS = ".//*[@id='content']/div/div/div[2]/table/tbody/tr[2]/td[1]/font/a"
XPATH_ALL_INSTRUCTION_IDS  = "//form/div[1]/div[5]/div/div/div[2]/table/tbody/tr/td[1]/font/a"
XPATH_ALL_INSTRUCTION_IDS  = "id('content')/div/div/div[2]/table/tbody/tr/td[1]/font/a"
XPATH_ALL_INSTRUCTION_IDS  = "/html/body/form/div[1]/div[5]/div/div/div[2]/table/tbody/tr/td[1]/font/a"
XPATH_ALL_INSTRUCTION_IDS = "//html//body/form/div[1]/div[5]/div/div/div[2]/table/tbody/tr/td[1]//a"
XPATH_ALL_INSTRUCTION_IDS = "//html//body/form/div[1]/div[5]/div/div/div[2]/table/tbody/tr/td[1]/*/a"

然而,当传递给xpath()

lxml.html.fromstring()方法时,它们都不起作用

2 个答案:

答案 0 :(得分:1)

// xpath运算符不要求您从文档的顶部开始。

XPATH_ALL_INSTRUCTION_IDS = '//font/a'

我建议您查看xpath cheatsheet

答案 1 :(得分:1)

我会在reference.nsf/links内找到包含href的所有链接:

//table//a[contains(@href, 'reference.nsf/links')]/text()

适合我。