使用Python进行Web报废:BeautifulSoup的问题

时间:2019-06-08 12:06:57

标签: beautifulsoup

请帮助我使用BeautifulSoup来使用Python 3从investing.com上获取finaces值。 无论我什么都得不到任何价值,并且filting类从网页上永久更改,因为它是实时价值。

import requests

from bs4 import BeautifulSoup

url = "https://es.investing.com/indices/spain-35-futures"
precio_objetivo = input("Introduce el PRECIO del disparador:")
precio_objetivo = float(precio_objetivo)
print (precio_objetivo)

while True:
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
precio_actual = soup.find('span', attrs={'class': 'arial_26 inlineblock pid-8828-last','id':'last_last','dir':'ltr'})
print (precio_actual)
break;

当我不对汤.find应用任何过滤器时(至少尝试获取所有网页),我得到以下结果:

<bound method Tag.find_all of 
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"

http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd”>

<html>
<head>
<title>403 You are banned from this site.  Please contact via a different client configuration if you believe that this is a mistake.                                </title>
</head>
<body>
<h1>Error 403 You are banned from this site.  Please contact via a different client configuration if you believe that this is a mistake.</h1>
<p>You are banned from this site.  Please contact via a different client configuration if you believe that this is a mistake.</p>
<h3>Guru Meditation:</h3>
<p>XID: 850285196</p>
<hr/>
<p>Varnish cache server</p>
</body>
</html>

3 个答案:

答案 0 :(得分:0)

该网站似乎检测到了请求的来源,因此我们需要“欺骗”它,以为我们使用的是浏览器。

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

r = Request("https://es.investing.com/indices/spain-35-futures", headers={"User-Agent": "Mozilla/5.0"})
c = urlopen(r).read()
soup = BeautifulSoup(c, "html.parser")
print(soup)

答案 1 :(得分:0)

Web服务器将python脚本检测为机器人,因此将其阻止。 通过使用标头,您可以阻止它,并且以下代码可以做到这一点:

import requests
from bs4 import BeautifulSoup

url = "https://es.investing.com/indices/spain-35-futures"

header={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36'}
page=requests.get(url,headers=header)

soup=BeautifulSoup(page.content,'html.parser')
#this soup returns <span class="arial_26 inlineblock pid-8828-last" dir="ltr" id="last_last">9.182,5</span>

result = soup.find('span',attrs={'id':'last_last'}).get_text()
#use the get_text() function to extract the text

print(result)

答案 2 :(得分:0)

您可以尝试使用 selenium 网络驱动程序。否则,如果请求数量很高,您将更多地面对这件事。有时,使用 JavaScript 的网站也会出现问题。

from selenium import webdriver
url = 'https://example.com/'
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(options=options,executable_path='/usr/local/bin/chromedriver')
driver.get(url)