使用BeautifulSoup仅从html提取脚本标签内容之外的文本

时间:2018-12-10 07:15:51

标签: python python-3.x beautifulsoup urllib3

我有这样的html

<span class="age">
    Ages 15
    <span class="loc" id="loc_loads1">
     </span>
     <script>
        getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);
     </script>
</span>

我正在尝试使用Age 15提取BeautifulSoup

所以我写了如下的python代码

代码:

from bs4 import BeautifulSoup as bs
import urllib3

URL = 'html file'

http = urllib3.PoolManager()

page = http.request('GET', URL)

soup = bs(page.data, 'html.parser')
age = soup.find("span", {"class": "age"})

print(age.text)

输出:

Age 15 getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);

我只希望Age 15而不是script标记内的函数。有什么办法只能获取文本:Age 15?或以任何方式排除script标记的内容?

  

PS:脚本标记过多,URL不同。我不喜欢   替换输出中的文本。

2 个答案:

答案 0 :(得分:2)

最新答案,但为将来参考,您也可以使用decompose()script中删除所有html元素,即:

soup = BeautifulSoup(html, "html.parser")                  
# remove script and style elements                         
for script in soup(["script", "style"]):                   
    script.decompose()                                     
print(soup.find("span", {"class": "age"}).text.strip())    
# Ages 15

答案 1 :(得分:1)

使用.find(text=True)

EX:

from bs4 import BeautifulSoup

html = """<span class="age">
    Ages 15
    <span class="loc" id="loc_loads1">
     </span>
     <script>
        getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);
     </script>
</span>"""

soup = BeautifulSoup(html, "html.parser")
print(soup.find("span", {"class": "age"}).find(text=True).strip())

输出:

Ages 15