我有以下HTML代码
<ol>
<li>If someone is <b>able</b> to do something, they <a href="/wiki/can" title="can">can</a> do it.
<dl>
<dd><i>I'm busy today, so I won't be <b>able</b> to see you.</i></dd>
</dl>
</li>
</ol>
如何在<li>
和<dl>
代码之间提取文字。
我试过这个:
from bs4 import BeautifulSoup
s = """<ol>
<li>If someone is <b>able</b> to do something, they <a href="/wiki/can" title="can">can</a> do it.
<dl>
<dd><i>I'm busy today, so I won't be <b>able</b> to see you.</i></dd>
</dl>
</li>
</ol>
"""
soup = BeautifulSoup(s)
for line in soup.find_all('ol'):
print line.li.get_text()
这将打印
If someone is able to do something, they can do it.
I'm busy today, so I won't be able to see you.
我只想要第一行。
If someone is able to do something, they can do it.
答案 0 :(得分:4)
遍历line.li
对象的descendants,收集所有NavigableString
文字对象,当您遇到<dl>
标记时,请停止:
from bs4 import NavigableString
for line in soup.find_all('ol'):
result = []
for descendant in line.li.descendants:
if isinstance(descendant, NavigableString):
result.append(unicode(descendant).strip())
elif descendant.name == 'dl':
break
print u' '.join(result)
演示:
>>> for line in soup.find_all('ol'):
... result = []
... for descendant in line.li.descendants:
... if isinstance(descendant, NavigableString):
... result.append(unicode(descendant).strip())
... elif descendant.name == 'dl':
... break
... print u' '.join(result)
...
If someone is able to do something, they can do it.
如果您要为所有 <li>
标记(而不仅仅是第一个)执行此操作,则需要循环使用<li>
找到的.find_all()
标记代替:
for line in soup.find_all('ol'):
for item in line.find_all('li'):
result = []
for descendant in item.descendants:
if isinstance(descendant, NavigableString):
result.append(unicode(descendant).strip())
elif descendant.name == 'dl':
break
print u' '.join(result)