>>> BeautifulSoup('<span>this is a</span>cat').text
u'this is acat'
>>> BeautifulSoup('Spelled f<b>o</b>etus in British English with extra "o"').text
u'Spelled foetus in British English with extra "o"'
标记标记之间的一些分析应该在它们之间留出空格(就像那个acat
)。有什么方法可以确保解析器在任何有意义的地方放置空格?我正在尝试将电子邮件转换为文本。
答案 0 :(得分:2)
没关系,我错了:
def grab(soup):
return ' '.join(unicode(i.string) for i in soup.body.contents)
# soup.body.contents contains a list of all the tags
# [<span>this is a</span>, u'cat']
# [<p>Spelled f<b>o</b>etus in British English with extra "o"</p>]
# i.string gets the text of a tag, similar to .text, but if there are tags in the tag you want to get the .string of, it will return None.
# unicode() is used to convert it from a bs4 type to a string type. Used to call ' '.join()
# It's good to use unicode() instead of str():
## If you want to use a NavigableString outside of Beautiful Soup,
## you should call unicode() on it to turn it into a normal
## Python Unicode string. If you don’t, your string will carry around
## a reference to the entire Beautiful Soup parse tree, even when
## you’re done using Beautiful Soup. This is a big waste of memory.
# Lastly, as .contents returns a list, we join it together.
soup1 = BeautifulSoup('<span>this is a</span>cat')
soup2 = BeautifulSoup('Spelled f<b>o</b>etus in British English with extra "o"')
soups = [soup1, soup2] # here we have a list of the soups
for i in soups:
result = grab(i) # It will be either u'None', or the correct string with a space
if result == 'None': # If the result had a tag in between (i.e, like your second example)
print i.text
else:
print result # The result with a space.
打印:
this is a cat
Spelled foetus in British English with extra "o"
答案 1 :(得分:0)
根据评论进行了编辑:
BeautifulSoup支持第一个示例。您要做的就是
BeautifulSoup('<span>this is a</span>cat').get_text(" ")
它将使用空格将两个元素之间的文本连接在一起。已记录here