Question

我想让BeautifulSoup做以下事情。

我有要修改的HTML文件。我特别感兴趣的是两个标签，我将称之为TagA

<div class ="A">...</div>

我将称之为TagB

<p class = "B">...</p>

两个标记在整个HTML中独立出现，并且可能包含其他标记并嵌套在其他标记内。我想在每个TagA周围放置一个标记标记，只要TagB 没有立即跟踪，这样

<p class="A"">...</p> becomes <marker><p class="A">...</p></marker>

但是当TagB立即跟踪TagA 时，我希望标记Tag将它们包围起来

这样

<p class="A">...</p><div class="B">...</div> 
becomes 
<marker><p class="A">...</p><div class="B">...</div></marker>

我可以看到如何选择TagA并用标记标记将其括起来，但当它跟随TagB时，我不知道是否或如何扩展BeautiulSoup'选择'以包含NextSibling。任何帮助表示赞赏。

Answer 1

beautifulSoup 确实拥有“下一个兄弟”功能。找到A类的所有标签，并使用a.next_sibling检查它是否为b。

查看文档：

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-sideways

Answer 2

我认为通过尝试将“选择”从一个标签扩展到以下内容，我认为这是错误的。相反，我发现下面的代码插入外部'标记'标记，然后插入A和B标记就可以了。我对Python很陌生，所以我会非常感谢有关改进或障碍的建议。

def isTagB(tag):
#If tag is <p class = "B"> return true
#if not - or tag is just a string return false
    try:
        return tag.name == 'p'#has_key('p') and tag.has_key('B')
    except:
        return False

from bs4 import BeautifulSoup

soup = BeautifulSoup("""<div class = "A"><p><i>more content</i></p></div><div class = "A"><p><i>hello content</i></p></div><p class="B">da <i>de</i> da </p><div class = "fred">not content</div>""")


for TagA in soup.find_all("div", "A"):
    Marker = soup.new_tag('Marker')
    nexttag = TagA.next_sibling
    #skipover white space
    while str(nexttag).isspace():
        nexttag = nexttag.next_sibling
    if isTagB(nexttag):
        TagA.replaceWith(Marker)   #Put it where the A element is
        Marker.insert(1,TagA)
        Marker.insert(2,nexttag)
    else:
        #print("FALSE",nexttag)
        TagA.replaceWith(Marker)   #Put it where the A element is
        Marker.insert(1,TagA)
print (soup)

Answer 3

import urllib
from BeautifulSoup import BeautifulSoup
html = urllib.urlopen("http://ursite.com") #gives html response
soup = BeautifulSoup(html)

all_div = soup.findAll("div",attrs={}) #use attrs as dict for attribute parsing 
#exa- attrs={'class':"class","id":"1234"}

single_div = all_div[0]

#to find p tag inside single_div
p_tag_obj = single_div.find("p")

你可以使用obj.findNext（），obj.findAllNext（），obj.findALLPrevious（），obj.findPrevious（），获取属性你可以使用obj.get（“href”），obj.get（“title”）等。

使用BeautifulSoup扩展选择范围

3 个答案: