Question

我想使用beautifulsoup提取段落元素中的文本。 html看起来像这样：

<span class="span_class>
 <h1>heading1</h1>
 <p>para1</p>
 <h1>heading 2</h1>
 <p>para2</p>
</span>

我想只在h1存在的情况下从第一个p中提取文本，依此类推; 到目前为止我已经尝试了

x=soup.findAll('span',{'class':'span_class'})
y=x.findAll('p')[0].text

但我没有得到它。

Answer 1

您可以在此处使用CSS同级选择器：

paragraphs = x.select('h1 + p')
# `paragraphs` now contains two elements: <p>para1</p> and <p>para2</p>

这将仅选择那些在它们之前具有直接H1兄弟的P元素。如果你想根据H1内容做更多的逻辑，你可以这样做：

for p x.select('h1:first-child + p'):
    # `p` contains the element that has `H1` before it.
    # `p.previous_sibling` contains `H1`.
    if p.previous_sibling.text == 'heading1':
        # We got the `P` that has `H1` with content `"heading1"` before it.
        print(p, p.previous_sibling)

Answer 2

html = '''<html>
<body>
<span class='span_class'>
<h1>heading1</h1>
<p>content1</p>
<p>content2</p>
<h1>heading2</h1>
<p>content3</p>
</span>
</body>
</html>'''
soup = bs(html, 'lxml')
x = soup.find_all('span',{'class':'span_class'}) #find span
try: 
    for y in x:
        heading = y.find_all('h1') # find h1
        for something in heading:  # if h1 exist
            if something.text == 'heading1':
                print(something.text)  # print h1
                try:
                    p = something.find_next('p') #try find next p
                    print(p)
                except:                 # if no next <p>, do nothing 
                    pass
            else:
                pass                    #if is is not 'heading1', do nothing
except Exception as e:
    print(e)

这是你在找什么？它会尝试查找您的<span>并尝试从中找到<h1>。如果<h1>位于<span>，则会查找下一个<p>。

仅当使用Beautifulsoup存在前一个标题时才从p中提取文本

2 个答案: