import requests
from lxml import html
page = requests.get(url="http://www.cia.gov/library/publications/the-world-factbook/geos/ch.html")
tree = html.fromstring(page.content)
bordering = tree.xpath('//*[@id="wfb_data"]/table/tr[4]/td/ul[3]/li[4]/div[17]/span[2]/text()')
print bordering
我使用chrome开发人员模式检索了xPath,但它仍然给了我一个空的“边界”变量。对于可能出错的事我感到茫然。
答案 0 :(得分:3)
首先,您需要使用https
而不是http
:
https://www.cia.gov/library/publications/the-world-factbook/geos/ch.html
此外,还有一种更简单的方法来获取边界数据 - 查找包含span
文字的border countries
并获取next sibling's文字:
bordering = tree.xpath('//*[@id="wfb_data"]//span[starts-with(., "border countries")]/following-sibling::span')[0]
print(bordering.text_content())
打印:
Afghanistan 91 km, Bhutan 477 km, Burma 2,129 km, India 2,659 km, Kazakhstan 1,765 km, North Korea 1,352 km, Kyrgyzstan 1,063 km, Laos 475 km, Mongolia 4,630 km, Nepal 1,389 km, Pakistan 438 km, Russia (northeast) 4,133 km, Russia (northwest) 46 km, Tajikistan 477 km, Vietnam 1,297 km
答案 1 :(得分:0)
请在请求中使用User-Agent进行检查。
headers ={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0'}
page = requests.get(url , headers=headers,timeout=5, verify=False)
如果有效,请告诉我。
感谢。