我正在处理带有子标记的HTML元素,我想“忽略”或删除它们,以便文本仍然存在。刚才,如果我尝试.string
任何带标签的元素,我得到的只是None
。
import bs4
soup = bs4.BeautifulSoup("""
<div id="main">
<p>This is a paragraph.</p>
<p>This is a paragraph <span class="test">with a tag</span>.</p>
<p>This is another paragraph.</p>
</div>
""")
main = soup.find(id='main')
for child in main.children:
print child.string
输出:
This is a paragraph.
None
This is another paragraph.
我希望第二行为This is a paragraph with a tag.
。我该怎么做?
答案 0 :(得分:4)
for child in soup.find(id='main'):
if isinstance(child, bs4.Tag):
print child.text
而且,你会得到:
This is a paragraph.
This is a paragraph with a tag.
This is another paragraph.
答案 1 :(得分:0)
请改用.strings
iterable。使用''.join()
拉入所有字符串并将它们连接在一起:
print ''.join(main.strings)
迭代.strings
会直接或在子标记中生成每个包含字符串。
演示:
>>> print ''.join(main.strings)
This is a paragraph.
This is a paragraph with a tag.
This is another paragraph.