如何从BeautifulSoup中的span标签获取文本

时间:2016-06-30 21:44:13

标签: python web-scraping beautifulsoup python-3.4

我的链接看起来像这样

<div class="systemRequirementsMainBox">
<div class="systemRequirementsRamContent">
<span title="000 Plus Minimum RAM Requirement">1 GB</span> </div>

我正试图从那里得到1 GB。我试过了

tt  = [a['title'] for a in soup.select(".systemRequirementsRamContent span")]
for ram in tt:
    if "RAM" in ram.split():
        print (soup.string)

输出None

我试过了a['text'],但它给了我KeyError。我该如何解决这个问题以及我的错误是什么?

5 个答案:

答案 0 :(得分:8)

您可以使用css选择器,使用标题文本拉出所需的范围:

soup = BeautifulSoup("""<div class="systemRequirementsMainBox">
<div class="systemRequirementsRamContent">
<span title="000 Plus Minimum RAM Requirement">1 GB</span> </div>""", "xml")

print(soup.select_one("span[title*=RAM]").text)

找到 span ,其 title 属性包含 RAM ,它相当于在python中说if "RAM" in span["title"]

或使用查找 re.compile

import re
print(soup.find("span", title=re.compile("RAM")).text)

获取所有数据:

from bs4 import BeautifulSoup 
r  = requests.get("http://www.game-debate.com/games/index.php?g_id=21580&game=000%20Plus").content

soup = BeautifulSoup(r,"lxml")
cont = soup.select_one("div.systemRequirementsRamContent")
ram = cont.select_one("span")
print(ram["title"], ram.text)
for span in soup.select("div.systemRequirementsSmallerBox.sysReqGameSmallBox span"):
        print(span["title"],span.text)

哪个会给你:

000 Plus Minimum RAM Requirement 1 GB
000 Plus Minimum Operating System Requirement Win Xp 32
000 Plus Minimum Direct X Requirement DX 9
000 Plus Minimum Hard Disk Drive Space Requirement 500 MB
000 Plus GD Adjusted Operating System Requirement Win Xp 32
000 Plus GD Adjusted Direct X Requirement DX 9
000 Plus GD Adjusted Hard Disk Drive Space Requirement 500 MB
000 Plus Recommended Operating System Requirement Win Xp 32
000 Plus Recommended Hard Disk Drive Space Requirement 500 MB

答案 1 :(得分:2)

我尝试使用 find_all() (BeautifulSoup) 中的 bs4 函数提取 HTML 文档中所有 span 标签内的文本:

from bs4 import BeautifulSoup
import requests
url="YOUR_URL_HERE"
response=requests.get(url)
soup=BeautifulSoup(response.content,html5lib)
spans=soup.find_all('span',"ENTER_Css_CLASS_HERE")
for span in spans:
  print(span.text)

答案 2 :(得分:0)

您可以仅在BeautifulSoup中使用span标签,也可以将classtitle等其他属性与span标签一起使用。

from BeautifulSoup import BeautifulSoup as BSHTML

htmlText = """<div class="systemRequirementsMainBox">
<div class="systemRequirementsRamContent">
<span title="000 Plus Minimum RAM Requirement">1 GB</span> </div>"""

soup = BSHTML(htmlText)
spans = soup.findAll('span')
# spans = soup.findAll('span', attrs = {'class' : 'your-class-name'}) # or span by class name
# spans = soup.findAll('span', attrs = {'title' : '000 Plus Minimum RAM Requirement'}) # or span with a title
for span in spans:
    print span.text

答案 3 :(得分:0)

遍历文件夹中的所有标签后,

contents [0]'。

答案 4 :(得分:0)

您可以使用几行gazpacho来解决此问题:

from gazpacho import Soup

html = """\
<div class="systemRequirementsMainBox">
<div class="systemRequirementsRamContent">
<span title="000 Plus Minimum RAM Requirement">1 GB</span> </div>
"""

soup = Soup(html)
soup.find("span", {"title": "Minimum RAM Requirement"}).text
# '1 GB'