Question

我正在使用python Beautiful汤来获取内容：

<div class="path">
    <a href="#"> abc</a>
    <a href="#"> def</a>
    <a href="#"> ghi</a>
</div>

我的代码如下：

html_doc="""<div class="path">
    <a href="#"> abc</a>
    <a href="#"> def</a>
    <a href="#"> ghi</a>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

path = soup.find('div',attrs={'class':'path'})
breadcrum = path.findAll(text=True)

print breadcrum

输出如下，

[u'\n', u'abc', u'\n', u'def', u'\n', u'ghi',u'\n']

我怎样才能以这种形式获得结果：abc,def,ghi作为单个字符串？

此外，我想了解所获得的输出。

Answer 1

如果您只是在breadcrum中剥离项目，则最终会在列表中显示空白项目。您既可以像shaktimaan那样建议，也可以使用

breadcrum = filter(None, breadcrum)

或者您可以事先将它们全部剥离（在html_doc中）：

mystring = mystring.replace('\n', ' ').replace('\r', '')

要么获取字符串输出，请执行以下操作：

','.join(breadcrum)

Answer 2

除非我遗漏了某些内容，否则只需合并strip和列表理解。

<强>代码：

from bs4 import BeautifulSoup as bsoup

ofile = open("test.html", "r")
soup = bsoup(ofile)

res = ",".join([a.get_text().strip() for a in soup.find("div", class_="path").find_all("a")])
print res

<强>结果：

abc,def,ghi
[Finished in 0.2s]

删除新行＆＃39; \ n＆＃39;从python BeautifulSoup的输出

2 个答案: