您如何使用硒刮整个页面?

时间:2020-07-29 15:52:11

标签: python selenium-webdriver web-scraping

我的目标是能够读取大量div中嵌套的内容。唯一的问题是它们似乎依赖于javascript,因此据我所知,我无法仅通过使用driver.page_source来获取它们。

这是我的代码:

import requests # for making standard html requests
from bs4 import BeautifulSoup # magical tool for parsing html data
import json # for parsing data
from pandas import DataFrame as df # premier library for data organization
import time
import lxml
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager


url = "https://www.challengermode.com/dota2/tournaments?state=upcoming"
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
time.sleep(5) # To let the page load in
soup_ID = BeautifulSoup(driver.page_source, 'html.parser')
print(soup_ID.prettify)

Here is an image of the span of information I want to be included in the print

这是我的输出:

<bound method Tag.prettify of <html class="arena-html mod_flexbox mod_flexwrap mod_cssscrollbar mod_eventlistener mod_scriptasync mod_localstorage mod_sessionstorage mod_websockets mod_eventsource" id="html" lang="en" style="margin: 0px; padding: 0px;"><head>
<base href="/"/>
<link href="https://fonts.googleapis.com/css?family=Roboto:300,400,400i,500,700&amp;display=swap" rel="stylesheet"/>
<link as="style" href="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/light.43d62e718e19239b66ac.css" rel="preload"/>
<link href="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/light.43d62e718e19239b66ac.css" media="all" onload="this.media='all'" rel="stylesheet"/>
<noscript><link href="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/light.43d62e718e19239b66ac.css" rel="stylesheet"/></noscript>
<link as="style" href="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/arena-paypal.26f2c9c2acd9b96ba93b.css" rel="preload"/>
<link href="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/arena-paypal.26f2c9c2acd9b96ba93b.css" media="all" onload="this.media='all'" rel="stylesheet"/>
<noscript><link href="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/arena-paypal.26f2c9c2acd9b96ba93b.css" rel="stylesheet"/></noscript>
<script async="" src="https://widget.intercom.io/widget/yxk7m4ye" type="text/javascript"></script><script async="" src="https://www.google-analytics.com/gtm/js?id=GTM-MHVMG4G&amp;t=gtag_UA_63855440_1&amp;cid=2113228608.1596037460" type="text/javascript"></script><script async="" src="https://www.google-analytics.com/plugins/ua/linkid.js" type="text/javascript"></script><script async="" src="https://www.googleadservices.com/pagead/conversion_async.js" type="text/javascript"></script><script async="" src="https://connect.facebook.net/signals/config/1363905500304531?v=2.9.22&amp;r=stable"></script><script async="" crossorigin="anonymous" src="https://connect.facebook.net/en_US/sdk.js?hash=4c7217325ae946d41396c9d017814623&amp;ua=modern_es6"></script><script async="" src="https://www.googletagmanager.com/gtag/js?id=AW-969263990&amp;l=dataLayer&amp;cx=c" type="text/javascript"></script><script async="" src="https://www.google-analytics.com/analytics.js" type="text/javascript"></script><script async="" src="https://www.gstatic.com/recaptcha/releases/AFBwIe6h0oOL7MOVu88LHld-/recaptcha__en.js" type="text/javascript"></script><script id="facebook-jssdk" src="//connect.facebook.net/en_US/sdk.js"></script><script async="" src="https://connect.facebook.net/en_US/fbevents.js"></script><script async="true" crossorigin="anonymous" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/manifest.2aa5da30056e9cc4eae7.bundle.js"></script>
<title>Dota 2 Tournaments | Challengermode</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=750, user-scalable=no" name="viewport"/>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" name="viewport"/>
<link href="/pwa-manifest.json" rel="manifest"/>
<link href="/opensearch" rel="search" title="Challengermode" type="application/opensearchdescription+xml"/>
<meta content="#252730" name="theme-color"/>
<meta content="#252730" name="msapplication-navbutton-color"/>
<meta content="yes" name="apple-mobile-web-app-capable"/>
<meta content="black-translucent" name="apple-mobile-web-app-status-bar-style"/>
<meta content="Challengermode" name="apple-mobile-web-app-title"/>
<link href="https://challengermode-permanent-assets.azureedge.net/app/cm-192-logo.png" rel="apple-touch-icon" sizes="192x192"/>
<link href="https://challengermode-permanent-assets.azureedge.net/app/cm-512-logo.png" rel="apple-touch-icon" sizes="512x512"/>
<link href="https://challengermode-permanent-assets.azureedge.net/app/splashscreens/iphone6_splash.png" rel="apple-touch-startup-image"/>
<link href="https://challengermode-permanent-assets.azureedge.net/app/splashscreens/iphonex_splash.png" media="(device-width: 375px) and (device-height: 812px) and (-webkit-device-pixel-ratio: 3)" rel="apple-touch-startup-image"/>
<link href="https://challengermode-permanent-assets.azureedge.net/app/splashscreens/iphone6_splash.png" media="(device-width: 375px) and (device-height: 667px) and (-webkit-device-pixel-ratio: 2)" rel="apple-touch-startup-image"/>
<link href="https://challengermode-permanent-assets.azureedge.net/app/splashscreens/iphoneplus_splash.png" media="(device-width: 414px) and (device-height: 736px) and (-webkit-device-pixel-ratio: 3)" rel="apple-touch-startup-image"/>
<link href="https://challengermode-permanent-assets.azureedge.net/app/splashscreens/iphone5_splash.png" media="(device-width: 320px) and (device-height: 568px) and (-webkit-device-pixel-ratio: 2)" rel="apple-touch-startup-image"/>
<link href="https://www.challengermode.com/tournaments/feed" rel="alternate" type="application/atom+xml"/>
<link href="https://www.challengermode.com/spaces/feed" rel="alternate" type="application/atom+xml"/>
<link href="https://www.challengermode.com/classifieds/feed" rel="alternate" type="application/atom+xml"/>
<meta content="Leading platform for Dota 2 esports tournaments. Compete in quality tournaments from the best organizers or create your own space &amp; monetize your community." name="description"/>
<meta content="challengermode esports competitions tournaments leagues skills solo team organize host
lol league of legends csgo counter-strike: global offensive pubg playerunknowns battlegrounds dota 2 teamfight tactics tft valorant" name="keywords"/>
<meta content="index,follow" name="robots"/>
<meta content="English" name="language"/>
<link href="https://www.challengermode.com/dota2/tournaments?state=upcoming" rel="canonical"/>
<link href="https://api.challengermode.com" rel="dns-prefetch"/>
<link crossorigin="" href="https://api.challengermode.com" rel="preconnect"/>
<link href="https://syndication.twitter.com" rel="preconnect"/>
<link href="https://widget.intercom.io" rel="preconnect"/>
<link href="https://js.intercomcdn.com" rel="preconnect"/>
<link href="https://www.facebook.com" rel="preconnect"/>
<link crossorigin="" href="https://connect.facebook.net" rel="preconnect"/>
<link href="https://api-iam.intercom.io" rel="preconnect"/>
<link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
<link href="https://az416426.vo.msecnd.net" rel="preconnect"/>
<link href="https://stats.g.doubleclick.net" rel="preconnect"/>
<link crossorigin="" href="https://fonts.googleapis.com" rel="preconnect"/>
<link href="https://dc.services.visualstudio.com" rel="preconnect"/>
<meta content="https://www.challengermode.com/dota2/tournaments?state=upcoming" property="og:url"/>
<meta content="Dota 2 Tournaments" property="og:title"/>
<meta content="Leading platform for Dota 2 esports tournaments. Compete in quality tournaments from the best organizers or create your own space &amp; monetize your community." property="og:description"/>
<meta content="https://challengermode-permanent-assets.azureedge.net/app/og_image.png" property="og:image"/>
<meta content="image/png" property="og:image:type"/>
<meta content="1200" property="og:image:width"/>
<meta content="630" property="og:image:height"/>
<meta content="website" property="og:type"/>
<meta content="Challengermode" property="og:site_name"/>
<meta content="cm:game_info_slug:f52a42ce-3425-4dca-ab1d-e425ea1e71ea" property="og:cm_resource"/>
<meta content="3625f24494c7ac4f0ad3" name="wot-verification"/>
<meta content="1179483245396310" property="fb:app_id"/>
<style>

    body::after {
        content: "none";
        display: none !important
    }

    @media (max-width:1920px) {
        body::after {
            content: "breakpoint--full-hd"
        }
    }

    @media (max-width:1280px) {
        body::after {
            content: "breakpoint--hd"
        }
    }

    @media (max-width:1024px) {
        body::after {
            content: "breakpoint--tablet"
        }
    }

    @media (max-width:414px) {
        body::after {
            content: "breakpoint--mobile"
        }
    }
</style>
<script src="//az416426.vo.msecnd.net/scripts/a/ai.0.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/0.1d1eb0a321bfe9aa47ee.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/1.97217bf357c5de4a751a.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/2.3240916b8c45c6c77a5b.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/3.966cc108df5a7515bf50.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/7.ed08c498b552166708b9.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/175.f6ae048c521d527a8f53.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/282.a0ab5b4c130061ae89b3.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/323.86bf89e818dd1c06cf21.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/337.c989accb4d8622d946e5.bundle.js"></script><style data-emotion=""></style><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/5.da829e90054bb31c6591.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/4.fc75798185acc24a996a.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/6.ba3b4ef40d494de88ed8.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/8.0a8441153a17e1c20931.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/9.92e08e43b5aeab83b11a.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/11.75d6926838e4e7c55f20.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/17.0c42d6a55e624fc36e4c.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/51.79196085aeb507e3486e.bundle.js"></script><link href="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/10.5df7cf3cfa886d3230a3.css" rel="stylesheet" type="text/css"/><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/10.8c3b8aef15bdf341e192.bundle.js"></script><link href="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/13.8ddd5b6f8bfee769c14a.css" rel="stylesheet" type="text/css"/><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/13.d20ed356ddb838ab76ce.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/16.9da04cea0e07cef002f4.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/22.13bf9d744401ea38a0bd.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/30.803fc5a3967c13785bb5.bundle.js"></script><link href="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/71.ab772642f9c8624e736d.css" rel="stylesheet" type="text/css"/><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/71.e7da16d37e16b62bf79b.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/158.818c18197b42c18410d9.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/284.8b5c95597f8814f01390.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/12.404ebfb3d2a9e09d5abc.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/75.9fddbb16d492adbd2ab5.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/161.1c91a2c2545bb21d7e20.bundle.js"></script><script src="https://googleads.g.doubleclick.net/pagead/viewthroughconversion/969263990/?random=1596037459961&amp;cv=9&amp;fst=1596037459961&amp;num=1&amp;bg=ffffff&amp;guid=ON&amp;resp=GooglemKTybQhCsO&amp;u_h=1080&amp;u_w=1920&amp;u_ah=1080&amp;u_aw=1920&amp;u_cd=24&amp;u_his=2&amp;u_tz=120&amp;u_java=false&amp;u_nplug=3&amp;u_nmime=4&amp;gtm=2oa7m1&amp;sendb=1&amp;ig=1&amp;data=event%3Dpage_view&amp;frm=0&amp;url=https%3A%2F%2Fwww.challengermode.com%2Fdota2%2Ftournaments%3Fstate%3Dupcoming&amp;tiba=Dota%202%20Tournaments%20%7C%20Challengermode&amp;hn=www.googleadservices.com&amp;async=1&amp;rfmt=3&amp;fmt=4"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/14.eb76c66c32e99864e5ad.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/15.1379135acdc99c059dcd.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedg

所需的输出将是在输出中显示蓝色和红色标记的all the source code

哦,如果您有任何疑问或需要更多信息,我很乐意提供。

2 个答案:

答案 0 :(得分:0)

我找到了一种解决方法。似乎我无法打印所有内容,但仍将其存储。因此,如果我使用driver.find_element_by_class_name(“ link-white”),则可以完美实现我的目标。

答案 1 :(得分:-1)

dates = driver.find_elements_by_xpath('//span[@class="f--medium f--small--mobile fw--bold c--white-dark tt--u lh--1em ellipsis dis--blk"/span/span')
for a in dates:
    print(a.text)

find_elements_by_xpath将获取页面源中的任何元素,并为您提供选择器列表。在这里,我们将日期嵌套在一个span> span> span中。

这是一个xpath选择器,尽管您可以通过其他方式(例如CSS,ID等)来实现它。

  • //搜索整个html文档
  • span[@class="xx"]-我们需要class =“ xx”
  • 的跨度
  • / span / span-使用/ span,我们可以在任何html标记下游。在这种情况下,是/ span / span。

然后,我创建了一个for循环,以在页面上打印所有日期的文本。