Question

我试图刮掉＆＃34; span＆＃34;使用BeautifulSoup标记。这是我的代码..

import urllib
from bs4 import BeautifulSoup
url="someurl"
res=urllib.urlopen(url)
html=res.read()
soup=BeautifulSoup(html,"html.parser")
soup.findAll("span")

但是当我这样做时，对于某些特定的网页。它没有列出所有跨度。它只是显示有限的没有。跨度。但是当我做的时候

soup.prettify()

它包含所有跨度.. 可能是什么原因？我错过了什么吗？我发现的一些答案也是使用像＃34; htmlunit＆＃34;这样的无头浏览器。但我不确定它们到底是什么？我可以将它们整合到我的django项目中吗？

汤.prettify给出了 https://drive.google.com/file/d/0BxhTzDujWhPVTzdIS2VWd1pZcHM/view?usp=sharing

预计输出的汤.findAll（＆＃34; span＆＃34;）

list of all the spans

输出即时

[<span class="ssc-ftpl ssc_ga_tag" data-gaa="Opened" data-gac="Footer" data-gal="Responsible Gambling" tabindex="0"> Responsible Gambling</span>, <span class="ssc-ftpl ssc_ga_tag" data-gaa="Opened" data-gac="Footer" data-gal="About Betfair" tabindex="0"> About Betfair</span>, <span class="ssc-ftpl ssc-ftls " tabindex="0">English - UK</span>, <span class="ssc-ftpl" tabindex="0">\xa9 \xae</span>]

Answer 1

也许你正在试图抓不同的页面，但我没有抓到该网站的问题。这是我的代码：

url='https://www.betfair.com/sport/football'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)

test = soup.find_all('span')
for span in test:
    print(span)

这产生了一个很大的跨度列表，包括我认为你感兴趣的行/分数：

<span class="ssc-lkh"></span>
<span>Join Now</span>
<span class="new flag-en"></span>
<span class="new flag-en"></span>
<span class="sportIcon-6423"></span>
<span class="sportName">American Football</span>
<span class="sportIcon-3988"></span>
<span class="sportName">Athletics</span>
<span class="sportIcon-61420"></span>
.....

针对以下评论进行了更新

以下是一些修订后的代码，表明我的代码确实可以提供您需要的span。

url='https://www.betfair.com/sport/football'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)

test = soup.find_all('span',attrs={"class":"away-team-name"}) 
for span in test:
    print("away team" + span.text)

产地：

away team
Marseille

away team
Lazio

away team
Academica

away team
Canada (W)

away team
Arnett Gardens FC

away team
UWI FC
....

Answer 2

终于找到了解决方案..问题是默认的＆＃34; html.parser＆＃34;，这是无法处理的。使用＆＃34; html5lib＆＃34;而是解析。并获得理想的结果。

soup=BeautifulSoup(html,"html5lib")
soup.findAll("span")

html5lib解析器完全按照浏览器的方式解析页面。

使用BeautifulSoup刮擦跨度

2 个答案: