我想提取两个标签之间包含的给定标签的所有实例。目前,我正在与BeautifulSoup合作。 您可以在下面找到一个示例:
<p class='x' id = '1'> some content 1 <p>
<p class='y' id = 'a'> some content a <p>
<p class='y' id = 'b'> some content b <p>
<p class='y' id = 'c'> some content c <p>
<p class='potentially some other class'> <p>
<p class='x' id = '2'> some content 2 <p>
<p class='y' id = 'd'> some content d <p>
<p class='y' id = 'e'> some content e <p>
<p class='y' id = 'f'> some content f <p>
我有兴趣在两个标记“ x”之间选择类“ y”的所有实例,它们也具有不同的ID。关于特定示例,我想选择class ='y'的所有p来检索文本。我最终希望得到的输出是:“某些内容a”,“某些内容b”和“某些内容c”。
我尝试使用findAllNext方法,但这给了我“某些内容a”,“某些内容b”,“某些内容c”和“某些内容d”,“某些内容e”,“某些内容f”。
下面是我的代码
par = BeautifulSoup(HTML_CODE).content, 'lxml')
loc = par.find('p', class_ = 'x', id ='1')
desired = loc.findAllNext('p', class_ = 'y')
是否有办法避免也选择出现在id ='2'的class ='x'标记之后的class ='y'实例?
谢谢。
答案 0 :(得分:2)
您可以从所需的位置开始迭代,然后结束直到发现标记完成为止。
from bs4 import BeautifulSoup
html = """
<p class='x' id = '1'> some content 1 </p>
<p class='y' id = 'a'> some content a </p>
<p class='y' id = 'b'> some content b </p>
<p class='y' id = 'c'> some content c </p>
<p class='potentially some other class1'> potentially some other class 1 </p>
<p class='potentially some other class2'> potentially some other class 2</p>
<p class='potentially some other class3'> potentially some other class 3 </p>
<p class='x' id = '2'> some content 2 </p>
<p class='y' id = 'd'> some content d </p>
<p class='y' id = 'e'> some content e </p>
<p class='y' id = 'f'> some content f </p>
"""
soup = BeautifulSoup(html,"lxml")
start = soup.find("p",class_="y",id="c")
end = soup.find("p",class_="x",id="2")
def next_ele(ele,result=[]):
row = ele.find_next("p")
if not row or row == end:
return result
result.append(row)
return next_ele(row,result)
print(next_ele(start))