如何选择未包含在带有beautifulsoup标签的div中的所有兄弟姐妹?

时间:2017-11-29 15:26:53

标签: python beautifulsoup

如何使用$onInit() { console.log(this.projectPagination, this.consultantPagination, this.newsletterPagination); this.updateList( this.projectPagination, this.consultantPagination, this.newsletterPagination ); } 选择未包含在代码中的所有div.title的所有第一个兄弟?

在下面的示例中,我需要检索:

beautifulsoup

示例

[Text I care about which <b>can</b> have formatting..., 
 Text I care about., 
 Text I care about <span class='someclass'>which can be in a span</span>...]

请注意,我需要使用一些正则表达式修改特定位置的文本。因此,我需要包含格式标记的整个文本(<div class="level1"> <div class="title"> Title I do not care about </div> <div class="level2"> <div class="title"> Title I do not care about </div> Text I care about which <b>can</b> have formatting... </div> <div class="level2"> <div class="title"> Title I do not care about </div> <div class="level3"> <div class="title"> Title I do not care about </div> Text I care about. </div> <div class="level3"> <div class="title"> Title I do not care about </div> Text I care about <span class='someclass'>which can be in a span</span>... </div> </div> </div> bbr等。)

2 个答案:

答案 0 :(得分:0)

您可以使用bs4 extract()方法从find_all结果项中删除不需要的代码。

例如:

import bs4
soup = bs4.BeautifulSoup(texthere)
divs = soup.find_all("div", {"class":"level3"}) #Finds all divs
for div in divs:
     title = div.find("div", {"class":"title"}) #Finds the title within each div
     title.extract() #Remove that title from the div
     print(div.text) #Here I print the div.text, but you can repurpose this for whatever you need

以下是SO的良好来源:Exclude unwanted tag on Beautifulsoup Python

希望它有所帮助!

答案 1 :(得分:0)

`from bs4 import BeautifulSoup;

strn =""" 
<div class="level1">
    <div class="title">
        Title I do not care about
    </div>
    <div class="level2">
        <div class="title">
            Title I do not care about
        </div>
        Text I care about which <b>can</b> have formatting...
    </div>
    <div class="level2">
        <div class="title">
            Title I do not care about
        </div>
        <div class="level3">
            <div class="title">
                Title I do not care about
            </div>
            Text I care about. 
        </div>
        <div class="level3">
            <div class="title">
                Title I do not care about
            </div>
            Text I care about <span class='someclass'>which can be in a span</span>...
        </div>
    </div>
</div> """



soup = BeautifulSoup(strn, 'html.parser')

the_divs= soup.find_all('div', class_='title')
for the_div in the_divs:
    for the_sibling in the_div.parent.contents:
        if the_sibling.name != 'div':
            print the_sibling.string
`

使用&#39; the_sibling&#39;变量在这里形成一个你需要的字符串,例如&#39; STR(the_sibling)&#39;会返回包含标签的文字(您的或