Python BeautifulSoup-刮除Div Spans和p标签-以及如何在div名称上获得完全匹配

时间:2018-10-31 13:46:35

标签: python html beautifulsoup

我要抓取两个具有相同名称的div(但我不希望页面上还有其他div的名称部分匹配)。 首先,我只需要每个span元素内的文本。在第二个中,我需要span元素内的文本,对于第一个 行,然后我需要

标记内的第2行和第3行的文本。

我什至不太确定为什么需要在div的末尾进行切片(我认为是因为div类col返回的值大于2个相关的div,但在div的末尾添加:1似乎有帮助)

我的问题是-如何在div名称上获得完全匹配 如何在p标签内抓取 如何合并以上结果。我可以在span标签内获取文本,如下所示,但正如我在上面所说,我还需要在p标签内添加文本并合并结果。

数据来自此URL中的玩家详细信息部分-https://www.skysports.com/football/player/141016/alisson-ramses-becker

html看起来像这样

    <div class="row-table details -bp30">
        <div class="col">
            <p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span></strong></p>                <p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p>                <p>Place of birth:<span itemprop="nationality"> Brazil</span></p>               
                        </div>
        <div class="col">
            <p>Club: <span itemprop="affiliation">Liverpool</span></p><p>Squad: 13</p>                <p>Position: Goal Keeper</p>
        </div>
    </div>

我程序的相关部分

        premier_soup1 = player_soup.find('div', {'class': 'row-table details -bp30'})
        premier_soup_tr = premier_soup1.find_all('div', {'class': 'col'})

        divs = player_soup.find_all( 'div', {'class': 'col'})
        for div in divs[:1]:
            para = div.find_all('p')
            print(para)

输出-

    [<p class="text-h4 title">Player Details</p>, <p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span></strong></p>, <p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p>, <p>Place of birth:<span itemprop="nationality"> Brazil</span></p>, <p>Club: <span itemprop="affiliation">Liverpool</span></p>, <p>Squad: 13</p>, <p>Position: Goal Keeper</p>]                               

也-我知道我可以用这个获取跨度文本

divs = player_soup.find_all( 'div', {'class': 'col'})
for div in divs[:1]:
    spans = div.find_all('span')
    for span in spans:       
        print(span.text, ",", end=' ')

输出-

Alisson Ramses Becker , 02/10/1992 ,  Brazil , Liverpool ,              

2 个答案:

答案 0 :(得分:1)

您的主要问题是如何从<p>中提取文本,其中不包含<span>

NavigableString 字符串对应于标签中的一小段文本。因此,如果它们是NavigableString

的实例,则可以提取文本
from bs4 import BeautifulSoup,NavigableString
html = "your example"

soup = BeautifulSoup(html,"lxml")
for e in soup.find("p"):
    print(e,type(e))
#Name:  <class 'bs4.element.NavigableString'>
#<strong><span itemprop="name">Alisson Ramses Becker</span></strong> <class 'bs4.element.Tag'>

真实代码:

resultset = soup.find_all("p")
maintext = []
for result in resultset:
    for element in result:
        if isinstance(element, NavigableString):
            maintext.append(element)

print(maintext)
# ['Name: ', 'Date of birth:', 'Place of birth:', 'Club: ', 'Squad: 13', 'Position: Goal Keeper']

等于

[element for result in resultset for element in result if isinstance(element, NavigableString)]

我的完整测试代码

from bs4 import BeautifulSoup,NavigableString
html = """

    <div class="row-table details -bp30">
        <div class="col">
            <p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span></strong></p>                <p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p>                <p>Place of birth:<span itemprop="nationality"> Brazil</span></p>               
                        </div>
        <div class="col">
            <p>Club: <span itemprop="affiliation">Liverpool</span></p><p>Squad: 13</p>                <p>Position: Goal Keeper</p>
        </div>
    </div>
"""
soup = BeautifulSoup(html,"lxml")
resultset = soup.find_all("p")
fr = [element for result in resultset for element in result if isinstance(element, NavigableString)]
spanset = [e.text for e in soup.find_all("span",{"itemprop":True})]
setA = ["".join(z) for z in zip(fr,spanset)]
final = setA + fr[len(spanset):]
print(final)

输出

['Name: Alisson Ramses Becker', 'Date of birth:02/10/1992', 'Place of birth: Brazil', 'Club: Liverpool', 'Squad: 13', 'Position: Goal Keeper']

答案 1 :(得分:1)

假设您有权删除此网站,并且没有API或json返回,那么一种较慢的方法是:

from bs4 import BeautifulSoup as bs

html = '''
 <div class="row-table details -bp30">
        <div class="col">
            <p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span></strong></p>                <p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p>                <p>Place of birth:<span itemprop="nationality"> Brazil</span></p>               
                        </div>
        <div class="col">
            <p>Club: <span itemprop="affiliation">Liverpool</span></p><p>Squad: 13</p>                <p>Position: Goal Keeper</p>
        </div>
    </div>
'''

soup = bs(html,'html5lib')

data = [d.find_all('p') for d in soup.find_all('div',{'class':'col'})]

value = []
for i in data:
    for j in i:
        value.append(j.text)

print(value)