无法使用python中的漂亮汤解析div标签?

时间:2018-07-09 10:53:04

标签: python html web-scraping beautifulsoup

我正在学习使用漂亮的汤来解析html中的div容器。但是由于某种原因,当我将div容器的类名传递给我美丽的汤时,什么也没发生。当我尝试解析div时,我没有任何内容。我可能做错了什么。这是我的HTML和解析

                                  <div class="upcoming-date ng-hide no-league" ng-show="nav.upcoming" ng-class="{'no-league': !search.checkShowTitle(nav.sport,nav.todayHighlights,nav.upcoming,nav.orderBy,&quot;FOOTBALL - HIGHLIGHTS&quot;)}">
                        <span class="weekday">Monday</span>
                        <timecomponent datetime="'2018-07-09T20:00:00+03:00'" show-date="true" show-time="false" class="date ng-isolate-scope"><span class="ng-binding">09/07/18</span></timecomponent>
                        <div class="clear"></div>
                    </div>





            <div id="g1390856" class="match football FOOTBALL - HIGHLIGHTS" itemscope="" itemtype="https://schema.org/SportsEvent">
    <div class="leaguename ng-hide" ng-show="search.checkShowTitle(nav.sport,nav.todayHighlights,nav.upcoming,nav.orderBy,&quot;FOOTBALL - HIGHLIGHTS&quot;) &amp;&amp; (1 || (nav.upcoming &amp;&amp; 0))">
      <span class="name">
        <span class="flag-icon flag-icon-swe"></span>
          Sweden - Allsvenskan
      </span>
    </div>

    <ul class="meta">
        <li class="date">
            <timecomponent datetime="'2018-07-09T20:00:00+03:00'" show-date="true" show-time="false" class="ng-isolate-scope"><span class="ng-binding">09/07/18</span></timecomponent>
        </li>
        <li class="time">
            <timecomponent datetime="'2018-07-09T20:00:00+03:00'" show-date="false" show-time="true" class="ng-isolate-scope"><span class="ng-binding">20:00</span></timecomponent>
        </li>
        <li class="game-id"><span class="gameid">GameID:</span> 2087</li>
    </ul>
    <ul class="teams">
    <li>Hammarby</li>
    <li>Ostersunds</li>
</ul>
    <ul class="bet-selector">
        <li class="pick01" id="b499795664">
    <a data-id="499795664" ng-click="bets.pick($event, 499795664, 2087, 2.10)" class="betting-button pick-button " title="Hammarby">

                                        <span class="team">Hammarby</span>
                <span class="odd">2.10</span>
    </a>
</li>                    <li class="pick0X" id="b499795666">
    <a data-id="499795666" ng-click="bets.pick($event, 499795666, 2087, 3.56)" class="betting-button pick-button " title="Draw">
                                                <span class="team">Draw</span>
                <span class="odd">3.56</span>
    </a>
</li>                <li class="pick02" id="b499795668">
    <a data-id="499795668" ng-click="bets.pick($event, 499795668, 2087, 3.40)" class="betting-button pick-button " title="Ostersunds">

                                        <span class="team">Ostersunds</span>
                <span class="odd">3.40</span>
    </a>
</li>    </ul>
            <ul class="extra-picks">
                        <li>
                <a class="betting-button " href="/games/1390856/markets?league=0&amp;top=0&amp;sid=2087&amp;sportId=1">
                    <span class="short-desc">+13</span>
                    <span class="long-desc">View 13 more markets</span>
                </a>
            </li>
        </ul>
        <div class="game-stats">
                <a href="https://s5.sir.sportradar.com/sportpesa/en/match/13414729" onclick="window.open(this.href, 'newwindow', 'width=1024, height=800, resizable=yes, scrollbars=yes'); return false;"><img class="img-responsive" src="/img/chart-icon.png?v2.2.25.2"></a>
    </div>
    <div class="clear"></div>
</div>

............................................... ..............

parser.py

import requests
import urllib2
from bs4 import BeautifulSoup as soup
udata = urllib2.urlopen('https://www.sportpesa.co.ke/?sportId=1')
htmlsource = udata.read()
ssoup = soup(htmlsource,'html.parser')
page_div = ssoup.findAll("div",{"class":"match football FOOTBALL - HIGHLIGHTS"})

print page_div

2 个答案:

答案 0 :(得分:0)

“比赛足球-亮点”是动态课程,因此您只是空白列表。 这是我在python3中的代码

from bs4 import BeautifulSoup as bs4
import requests
request = requests.get('https://www.sportpesa.co.ke/?sportId=1')
soup = bs4(request.text, 'lxml')
print(soup)

打印汤后,您会发现该类不在您的源代码中……希望对您有帮助

答案 1 :(得分:0)

因此-如注释中所建议-从此站点获取数据的最佳(最快)方法是利用javascript使用的相同端点。

如果您使用的是Chrome,请弹出检查器工具,打开“网络”标签,然后加载页面。您会看到,该站点从URL获取数据,该URL与URL中实际显示的

非常相似。
  

https://sportpesa.co.ke/sportgames?sportId=1

此端点为您提供所需的数据。要使用请求来获取它并获取div,将如下所示:

import requests 
from bs4 import BeautifulSoup

r = requests.get("https://sportpesa.co.ke/sportgames?sportId=1")
soup = BeautifulSoup(r.text, "html.parser") 
page_divs = soup.select('div.match.football.FOOTBALL.-.HIGHLIGHTS')
print(len(page_divs))

将显示30-这是div的数量。顺便说一句,我在这里使用bs4方法选择,这是bs4推荐的处理方式,当您-像这里一样-具有多个类别(“比赛”,“足球”,“足球” ,“-”和“ HIGHLIGHTS”)。