我正在尝试使用XPath访问以下URL的元素: http://www.booking.com/searchresults.html?dest_id=2400&dest_type=region&offset=288
我要找的具体元素是Div类" sr_item_link_to_villas"。我一直在使用以下XPath来尝试访问它(在这个例子中我试图访问第二个列表,但完整的脚本循环遍历每个列表),但它返回一个空列表:
//*[@id="hotellist_inner"]/*[contains(@class,"sr_item")][2]//*[contains(@class,"sr_item_link_to_villas ")]
完整的代码是:
url='http://www.booking.com/searchresults.html?dest_id=2400&dest_type=region&offset=288'
page = parse(url).getroot()
pathstr='//*[@id="hotellist_inner"]/*[contains(@class,"sr_item")][2]//*[contains(@class,"sr_item_link_to_villas ")]'
content=page.xpath(pathstr)
答案 0 :(得分:0)
以下代码可能会解决您的目的。您必须为获取数据添加标头值。
import urllib2
from lxml import etree
from lxml.html import tostring,fromstring
def get_HTML(url):
header={"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:41.0) Gecko/20100101 Firefox/41.0","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Connection": "keep-alive"}
req=urllib2.Request(url,None,header)
return urllib2.urlopen(req).read()
url="http://www.booking.com/searchresults.html?dest_id=2400&dest_type=region&offset=288"
read = get_HTML(url)
tree = etree.HTML(read)
data = tree.xpath("//div[@class='sr_item_link_to_villas ']/a/text()");
print data