从span标签中获取文本问题

时间:2016-07-02 00:01:25

标签: python html web-scraping beautifulsoup python-3.4

this链接中,我想从r_compare_bars_value类中的span标记中获取文本。如果您搜索该课程,则会将文字视为104 (min: 88) fps,我只想采用min:88部分。我的代码;

from bs4 import BeautifulSoup
import urllib.request,requests
r = urllib.request.urlopen('http://www.notebookcheck.net/Computer-Games-on-Laptop-Graphics-Cards.13849.0.html').read()
soup = BeautifulSoup(r)

links = [a['href'] for a in soup.select(".gpugames_header_games > a")]

for url in links:
    if url != "":
        print (url)
        rr = requests.get(url).content
        soup = BeautifulSoup(rr,"html.parser")

        for aa in soup.select("div.r_compare_bars_value span"):
            print (aa)
            if "min:" in aa.text:
                print (aa.text)

但它现在没有打印任何东西,在其他类印刷的字符串上,而不是min:88部分。我也试过了div.tx-nbc2fe-pi1并尝试了没有span标签。代码在那个网站上真的很乱。我的错误在哪里?我该如何解决这个问题?

1 个答案:

答案 0 :(得分:0)

如果不操作分割,剥离等返回的文本,就无法做到这一点。 r_compare_bars_value 实际上也在一个span而不是div中,所以soup.select("span.r_compare_bars_value")是正确的选择器。

这实际上是正则表达式的一个很好的用例:

from bs4 import BeautifulSoup
import requests
import re
mn = re.compile("\(min:.*?\)")

r = requests.get('http://www.notebookcheck.net/Computer-Games-on-Laptop-Graphics-Cards.13849.0.html').content
soup = BeautifulSoup(r, "lxml")

links = (a["href"] for a in soup.select(".gpugames_header_games > a"))


for url in links:
    if url:
        rr = requests.get(url).content
        soup = BeautifulSoup(rr, "html.parser")
        for aa in soup.select("span.r_compare_bars_value"):
            m = mn.search(aa.text)
            if m:
                print(m.group())

在几个网址上运行以上内容即可:

(min: 88)
(min: 164)
(min: 251)
(min: 281)
(min: 283)
(min: 291)
(min: 75)
(min: 129)
(min: 202)
(min: 64)
(min: 94)
(min: 178)
(min: 53)
(min: 97)
(min: 154)
(min: 199)
(min: 289)
(min: 296)
(min: 55)
(min: 78)
(min: 39)
(min: 57)
(min: 109)
(min: 153)
(min: 200)
(min: 216)
(min: 39)
(min: 59)
(min: 110)