我一直在使用pylen的selenium webdriver尝试登录这个网站Login Page Here
为此,我在python中执行了以下操作:
from selenium import webdriver
import bs4 as bs
driver = webdriver.Chrome()
driver.get('https://app.chatra.io/')
然后我继续尝试使用Beautiful Soup解析:
html = driver.execute_script('return document.documentElement.outerHTML')
soup = bs.BeautifulSoup(html, 'html.parser')
print(soup.prettify)
主要问题是页面永远不会完全加载。当我自己在浏览器中加载页面时,一切都很好。然而,当selenium webdriver尝试加载它时,它似乎就在中途停止。
知道为什么吗?关于如何解决它或在哪里学习的想法?
答案 0 :(得分:1)
首先,这个问题在最新的Chrome中也是可以重现的(chromedriver
2.34 - 目前也是最新的) - 还不确定目前发生了什么。解决方法: Firefox完美地为我工作。
而且,我会在driver.get()
和HTML解析之间添加一个额外的步骤 - explicit wait让页面正确加载,直到所需的条件为真:
import bs4 as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get('https://app.chatra.io/')
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.ID, "signin-email")))
html = driver.execute_script('return document.documentElement.outerHTML')
soup = bs.BeautifulSoup(html, 'html.parser')
print(soup.prettify())
请注意,您还需要致电prettify()
- 这是一种方法。
答案 1 :(得分:1)
您面临的问题有以下几个方面:
当您尝试使用 BeautifulSoup
时,如果您尝试使用 urllib.request urlopen >错误说明了一切:
urllib.error.HTTPError: HTTP Error 403: Forbidden
这意味着检测到 urllib.request 并引发 HTTP Error 403: Forbidden
。因此,使用 webdriver
中的 selenium
是有道理的。
接下来,当您首先获得 ChromeDriver
和 Chrome
的帮助时,Website
会打开并呈现。但很快就会检测到 ChromeDriver
WebDriver
,并且ChromeDriver
无法解析<head>
&amp; <body>
个标签。您将最小标题视为:
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" class="supports cssfilters flexwrap chrome webkit win hover web"></html>
最后,当您获得 GeckoDriver
和 Firefox Quantum
的帮助时,Website
会打开并正确呈现,如下所示:
代码块:
from selenium import webdriver
from bs4 import BeautifulSoup as soup
driver = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
driver.get('https://app.chatra.io/')
html = driver.execute_script('return document.documentElement.outerHTML')
pagesoup = soup(html, "html.parser")
print(pagesoup)
控制台输出:
<html class="supports cssfilters flexwrap firefox gecko win hover web"><head>
<link class="" href="https://app.chatra.io/b281cc6b75916e26b334b5a05913e3eb18fd3a4d.css?meteor_css_resource=true&_g_app_v_=51" rel="stylesheet" type="text/css"/>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1, maximum-scale=1, minimum-scale=1, user-scalable=no, viewport-fit=cover" name="viewport"/>
.
.
.
<em>··· Chatra</em>
.
.
.
</div></body></html>
在汤提取中添加美化:
代码块:
from selenium import webdriver
from bs4 import BeautifulSoup as soup
driver = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
driver.get('https://app.chatra.io/')
html = driver.execute_script('return document.documentElement.outerHTML')
pagesoup = soup(html, "html.parser")
print(pagesoup.prettify)
控制台输出:
<bound method Tag.prettify of <html class="supports cssfilters flexwrap firefox gecko win hover web"><head>
<link class="" href="https://app.chatra.io/b281cc6b75916e26b334b5a05913e3eb18fd3a4d.css?meteor_css_resource=true&_g_app_v_=51" rel="stylesheet" type="text/css"/>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1, maximum-scale=1, minimum-scale=1, user-scalable=no, viewport-fit=cover" name="viewport"/>
.
.
.
<em>··· Chatra</em>
.
.
.
</div></body></html>>
即使您可以使用Selenium
page_source 方法,如下所示:
代码块:
from selenium import webdriver
driver = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
driver.get('https://app.chatra.io/')
print(driver.page_source)
控制台输出:
<html class="supports cssfilters flexwrap firefox gecko win hover web">
<head>
<link rel="stylesheet" type="text/css" class="" href="https://app.chatra.io/b281cc6b75916e26b334b5a05913e3eb18fd3a4d.css?meteor_css_resource=true&_g_app_v_=51">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, minimum-scale=1, user-scalable=no, viewport-fit=cover">
<!-- platform specific stuff -->
<meta name="msapplication-tap-highlight" content="no">
<meta name="apple-mobile-web-app-capable" content="yes">
<!-- favicon -->
<link rel="shortcut icon" href="/static/favicon.ico">
<!-- win8 tile -->
<meta name="msapplication-TileImage" content="/static/win-tile.png">
<meta name="msapplication-TileColor" content="#ffffff">
<meta name="application-name" content="Chatra">
<!-- apple touch icon -->
<!--<link rel="apple-touch-icon" sizes="256x256" href="/static/?????.png">-->
<title>··· Chatra</title>
<style>
body {
background: #f6f5f7
}
</style>
<style type="text/css"></style>
</head>
<body>
<script async="" src="https://www.google-analytics.com/analytics.js"></script>
<script type="text/javascript" src="/meteor_runtime_config.js"></script>
<script type="text/javascript" src="https://app.chatra.io/9153feecdc706adbf2c71253473a6aa62c803e45.js?meteor_js_resource=true&_g_app_v_=51"></script>
<div class="body body-layout">
<div class="body-layout__main main-layout">
<aside class="main-layout__left-sidebar">
<div class="left-sidebar-layout">
</div>
</aside>
<div class="main-layout__content">
<div class="content-layout">
<main class="content-layout__main is-no-fades js-popover-boundry js-main">
<div class="center loading loading--light">
<div class="content-padding nothing">
<em>··· Chatra</em>
</div>
</div>
</main>
</div>
</div>
</div>
</div>
</body>
</html>
&#13;