Question

请帮助我使用BeautifulSoup来使用Python 3从investing.com上获取finaces值。无论我什么都得不到任何价值，并且filting类从网页上永久更改，因为它是实时价值。

import requests

from bs4 import BeautifulSoup

url = "https://es.investing.com/indices/spain-35-futures"
precio_objetivo = input("Introduce el PRECIO del disparador:")
precio_objetivo = float(precio_objetivo)
print (precio_objetivo)

while True:
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
precio_actual = soup.find('span', attrs={'class': 'arial_26 inlineblock pid-8828-last','id':'last_last','dir':'ltr'})
print (precio_actual)
break;

当我不对汤.find应用任何过滤器时（至少尝试获取所有网页），我得到以下结果：

<bound method Tag.find_all of 
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"

“ http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd”>

<html>
<head>
<title>403 You are banned from this site.  Please contact via a different client configuration if you believe that this is a mistake.                                </title>
</head>
<body>
<h1>Error 403 You are banned from this site.  Please contact via a different client configuration if you believe that this is a mistake.</h1>
<p>You are banned from this site.  Please contact via a different client configuration if you believe that this is a mistake.</p>
<h3>Guru Meditation:</h3>
<p>XID: 850285196</p>
<hr/>
<p>Varnish cache server</p>
</body>
</html>

Answer 1

该网站似乎检测到了请求的来源，因此我们需要“欺骗”它，以为我们使用的是浏览器。

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

r = Request("https://es.investing.com/indices/spain-35-futures", headers={"User-Agent": "Mozilla/5.0"})
c = urlopen(r).read()
soup = BeautifulSoup(c, "html.parser")
print(soup)

Answer 2

Web服务器将python脚本检测为机器人，因此将其阻止。通过使用标头，您可以阻止它，并且以下代码可以做到这一点：

import requests
from bs4 import BeautifulSoup

url = "https://es.investing.com/indices/spain-35-futures"

header={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36'}
page=requests.get(url,headers=header)

soup=BeautifulSoup(page.content,'html.parser')
#this soup returns <span class="arial_26 inlineblock pid-8828-last" dir="ltr" id="last_last">9.182,5</span>

result = soup.find('span',attrs={'id':'last_last'}).get_text()
#use the get_text() function to extract the text

print(result)

Answer 3

您可以尝试使用 selenium 网络驱动程序。否则，如果请求数量很高，您将更多地面对这件事。有时，使用 JavaScript 的网站也会出现问题。

from selenium import webdriver
url = 'https://example.com/'
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(options=options,executable_path='/usr/local/bin/chromedriver')
driver.get(url)

使用Python进行Web报废：BeautifulSoup的问题

3 个答案: