我正在尝试使用python中的Requests从网页(下面的链接)获取一些信息;但是,当我通过python的请求库连接时,我在浏览器中看到的HTML数据似乎并不存在。没有任何xpath查询返回任何信息。我可以使用其他网站的请求,例如亚马逊(下面的网站实际上是由亚马逊拥有,但我似乎无法从中获取任何信息)。
url = 'http://www.myhabit.com/#page=d&dept=men&asin=B00R5TK3SS&cAsin=B00DNNZIIK&qid=aps-0QRWKNQG094M3PZKX5ST-1429238272673&sindex=0&discovery=search&ref=qd_men_sr_1_0'
user_agent = {'User-agent': 'Mozilla/5.0'}
page = requests.get(url, headers=user_agent)
tree = html.fromstring(page.text)
query = tree.xpath("//span[@id=ourPrice]/text()")
答案 0 :(得分:3)
该元素是使用javascript生成的,您可以使用selenium获取源代码,无头浏览将其与phantomjs结合使用:
url = 'http://www.myhabit.com/#page=d&dept=men&asin=B00R5TK3SS&cAsin=B00DNNZIIK&qid=aps-0QRWKNQG094M3PZKX5ST-1429238272673&sindex=0&discovery=search&ref=qd_men_sr_1_0'
from selenium import webdriver
browser = webdriver.PhantomJS()
browser.get(url)
_html = browser.page_source
from bs4 import BeautifulSoup
print(BeautifulSoup(_html).find("span",{"id":"ourPrice"}).text)
$50
答案 1 :(得分:0)
这是代码,我如何从一个站点废弃一个表。在该网站中,他们没有在表中定义id或类,因此您不需要放任何东西。如果id或class意味着只使用html.xpath('// table [@ id = id_val] / tr')而不是html.xpath('// table / tr')
from lxml import etree
import urllib
web = urllib.urlopen("http://www.yourpage.com/")
html = etree.HTML(web.read())
tr_nodes = html.xpath('//table/tr')
td_content = [tr.xpath('td') for tr in tr_nodes if [td.text for td in tr.xpath('td')][2] == 'Chennai' or [td.text for td in tr.xpath('td')][2] == 'Across India' or 'Chennai' in [td.text for td in tr.xpath('td')][2].split('/') ]
main_list = []
for i in td_content:
if i[5].text == 'Freshers' or 'Freshers' in i[5].text.split('/') or '0' in i[5].text.split(' '):
sub_list = [td.text for td in i]
sub_list.insert(6,'http://yourpage.com/%s'%i[6].xpath('a')[0].get('href'))
main_list.append(sub_list)
print 'main_list',main_list