在Python BeautifulSoup4中,如何提取这样的特殊文本

时间:2014-05-06 22:03:51

标签: python beautifulsoup

我正在尝试提取一些字符串。从这段文字:

    text = "<li>(<a rel="nofollow" class="external text" href="http://www.icd9data.com/getICD9Code.ashx?
    icd9=999.1">999.1</a>) <a href="/wiki/Air_embolism" title="Air embolism">Air embolism</a> as
    a complication of medical care not elsewhere classified</li>"

我的目标是&#34;作为医疗护理的并发症,而不是其他地方的分类&#34; 但语法不起作用:

    soup = bs4.Beautifulsoup(text)
    for tag in soup.find_all('li'):
        print tag.string

任何人都知道任何方法都可以调用我想要的字符串吗? 感谢。

1 个答案:

答案 0 :(得分:1)

for tag in soup.find_all('li'):
    print(tag.get_text())

打印

(999.1) Air embolism as
a complication of medical care not elsewhere classified

get_text方法返回标记中的所有文本,即使是作为子标记一部分的文本也是如此。


使用lxml,您可以使用

import lxml.html as LH
text = """<li>(<a rel="nofollow" class="external text" href="http://www.icd9data.com/getICD9Code.ashx?
icd9=999.1">999.1</a>) <a href="/wiki/Air_embolism" title="Air embolism">Air embolism</a> as
a complication of medical care not elsewhere classified</li>"""

doc = LH.fromstring(text)
for tag in doc.xpath('//li/a[2]'):
    print(tag.tail)

获取

 as
a complication of medical care not elsewhere classified