Question

我想基于h2类值提取段落数据。下面是html代码。

<div class="myClass">
<div itemprop="reviewBody" class="review-body">
<h2 class="h3">Test1</h2><p>I want to extract this</p>
<h2 class="h3">Test2</h2><p>Dont want to extract</p>
<h2 class="h3">Test3</h2><p>I want to extract this too</p>
< /div>
< /div>

输出应为

Test 1    | I want to extract this
Test 3    | I want to extract this too

下面是我的代码，但是它提取了所有测试（Test1，test2，test3）。如何基于h2文本提取数据？

soup = bs(page.text, 'html.parser')
divs = soup.find_all(class_="myClass")

test1= [] 

for item in divs[0].find_all('h2',class_="h3"):
    test1.append(item.text.strip())
print(test1)

Answer 1

如果我的理解正确，您想在h2文本上加上一个附加条件。您可以使用text的{{1}}参数，该参数可以包含要匹配的文本列表，例如：

.find_all()

如果您想进一步了解以下段落，则可以使用find_next_sibling()：

for h2 in soup.find_all('h2', class_='h3', text=['Test1', 'Test3']):
    print(h2.get_text())

按类别值文本提取beautifulsoup

1 个答案: