Question

我正在尝试使用python中的Requests从网页（下面的链接）获取一些信息;但是，当我通过python的请求库连接时，我在浏览器中看到的HTML数据似乎并不存在。没有任何xpath查询返回任何信息。我可以使用其他网站的请求，例如亚马逊（下面的网站实际上是由亚马逊拥有，但我似乎无法从中获取任何信息）。

url = 'http://www.myhabit.com/#page=d&dept=men&asin=B00R5TK3SS&cAsin=B00DNNZIIK&qid=aps-0QRWKNQG094M3PZKX5ST-1429238272673&sindex=0&discovery=search&ref=qd_men_sr_1_0'
user_agent = {'User-agent': 'Mozilla/5.0'} 
page = requests.get(url, headers=user_agent)
tree = html.fromstring(page.text)
query = tree.xpath("//span[@id=ourPrice]/text()")

Answer 1

该元素是使用javascript生成的，您可以使用selenium获取源代码，无头浏览将其与phantomjs结合使用：

url = 'http://www.myhabit.com/#page=d&dept=men&asin=B00R5TK3SS&cAsin=B00DNNZIIK&qid=aps-0QRWKNQG094M3PZKX5ST-1429238272673&sindex=0&discovery=search&ref=qd_men_sr_1_0'

from selenium import webdriver

browser = webdriver.PhantomJS()
browser.get(url)
_html = browser.page_source

from bs4 import BeautifulSoup

print(BeautifulSoup(_html).find("span",{"id":"ourPrice"}).text)
$50

Answer 2

这是代码，我如何从一个站点废弃一个表。在该网站中，他们没有在表中定义id或类，因此您不需要放任何东西。如果id或class意味着只使用html.xpath（'// table [@ id = id_val] / tr'）而不是html.xpath（'// table / tr'）

from lxml import etree
import urllib
web = urllib.urlopen("http://www.yourpage.com/")
html = etree.HTML(web.read())
tr_nodes = html.xpath('//table/tr')
td_content = [tr.xpath('td') for tr in tr_nodes  if [td.text for td in tr.xpath('td')][2] == 'Chennai' or [td.text for td in tr.xpath('td')][2] == 'Across India'  or 'Chennai' in [td.text for td in tr.xpath('td')][2].split('/') ]
main_list = []
for i in td_content:
    if i[5].text == 'Freshers' or  'Freshers' in i[5].text.split('/') or  '0' in i[5].text.split(' '):
       sub_list = [td.text for td in i]
       sub_list.insert(6,'http://yourpage.com/%s'%i[6].xpath('a')[0].get('href'))
       main_list.append(sub_list)
print 'main_list',main_list

使用Python请求库无法抓取网页

2 个答案: