<div class="michelinKeyBenefitsComp">
<section id="benefit-one-content">
<div class="inner">
<div class="col">
<h4 class="h-keybenefits">Banana is yellow.</h4>
<div class="content">
<p>Yellow is my favorite color.</p>
<p> </p>
<p>I love Banana.</p>
</div>
</div>
</div>
</section>
<section id="benefit-two-content">
<div class="inner">
<div class="col">
<h4 class="h-keybenefits">Apple is red.</h4>
<div class="content"><p>Red is not my favorite color.</p>
<p> </p>
<p>I don't like apple.</p>
</div>
</div>
</div>
</section>
</div>
我知道如何从此HTML中提取我想要的所有文本。这是我的代码:
for item in soup.find('div', {'class' : 'michelinKeyBenefitsComp'}):
try:
for tex in item.find_all('div', {'class' : 'col'}):
print(tex.text)
except:
pass
但是我想做的是分别提取内容,所以我可以分别保存它们。预期结果如下:
Banana is yellow.
Yellow is my favorite color.
I love Banana.
#save first
Apple is red.
Red is not my favorite color.
I don't like apple.
#save next
在这种情况下,只有2个段落,但在其他情况下,可能有3个或更多个段落。在不知道它们有多少段的情况下如何提取它们? TIA
答案 0 :(得分:1)
也许您应该尝试用这种方式提取文本,您的div
具有unique_id,但是要在其中选择节文本,可以使用类从特定的div正确选择文本,
from bs4 import BeautifulSoup
text = """
<div class="michelinKeyBenefitsComp">
<section id="benefit-one-content">
<div class="inner">
<div class="col">
<h4 class="h-keybenefits">Banana is yellow.</h4>
<div class="content">
<p>Yellow is my favorite color.</p>
<p> </p>
<p>I love Banana.</p>
</div>
</div>
</div>
</section>
<section id="benefit-two-content">
<div class="inner">
<div class="col">
<h4 class="h-keybenefits">Apple is red.</h4>
<div class="content"><p>Red is not my favorite color.</p>
<p> </p>
<p>I don't like apple.</p>
</div>
</div>
</div>
</section>
</div>
"""
soup = BeautifulSoup(text, 'html.parser')
main_div = soup.find('div', class_='michelinKeyBenefitsComp')
for idx, div in enumerate(main_div.select('section > div.inner > div.col')):
with open('file_'+str(idx)+'.txt', 'w', encoding='utf-8') as f:
f.write(div.get_text())
#Output in separate file: file_1.txt> Banana is yellow.
# Yellow is my favorite color.
# I love Banana.
答案 1 :(得分:0)
这应该有帮助。
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html, "html.parser")
for i in soup.find_all("section", {"id": re.compile("benefit-[a-z]+-content")}):
with open(i["id"]+".txt", "a") as outfile: #Create filename based on section ID and write.
outfile.write("\n".join([i for i in i.text.strip().split("\n") if i.strip()]) + "\n\n")