Python / BeautifulSoup - 如何在<li>和<dl>标签之间提取文本</dl> </li>

时间:2013-09-09 11:43:08

标签: python beautifulsoup html-parsing bs4

我有以下HTML代码

<ol>
<li>If someone is <b>able</b> to do something, they <a href="/wiki/can" title="can">can</a> do it.
<dl>
<dd><i>I'm busy today, so I won't be <b>able</b> to see you.</i></dd>
</dl>
</li>
</ol>

如何在<li><dl>代码之间提取文字。

我试过这个:

from bs4 import BeautifulSoup

s = """<ol>
    <li>If someone is <b>able</b> to do something, they <a href="/wiki/can" title="can">can</a> do it.
    <dl>
    <dd><i>I'm busy today, so I won't be <b>able</b> to see you.</i></dd>
    </dl>
    </li>
    </ol>
"""

soup = BeautifulSoup(s)

for line in soup.find_all('ol'):
    print line.li.get_text()

这将打印

If someone is able to do something, they can do it.

I'm busy today, so I won't be able to see you.

我只想要第一行。

If someone is able to do something, they can do it.

1 个答案:

答案 0 :(得分:4)

遍历line.li对象的descendants,收集所有NavigableString文字对象,当您遇到<dl>标记时,请停止:

from bs4 import NavigableString

for line in soup.find_all('ol'):
    result = []
    for descendant in line.li.descendants:
        if isinstance(descendant, NavigableString):
            result.append(unicode(descendant).strip())
        elif descendant.name == 'dl':
            break

    print u' '.join(result)

演示:

>>> for line in soup.find_all('ol'):
...     result = []
...     for descendant in line.li.descendants:
...         if isinstance(descendant, NavigableString):
...             result.append(unicode(descendant).strip())
...         elif descendant.name == 'dl':
...             break
...     print u' '.join(result)
... 
If someone is able to do something, they can do it.

如果您要为所有 <li>标记(而不仅仅是第一个)执行此操作,则需要循环使用<li>找到的.find_all()标记代替:

for line in soup.find_all('ol'):
    for item in line.find_all('li'):
        result = []
        for descendant in item.descendants:
            if isinstance(descendant, NavigableString):
                result.append(unicode(descendant).strip())
            elif descendant.name == 'dl':
                break

        print u' '.join(result)