BeautifulSoup连接词

时间:2013-05-27 06:22:46

标签: python beautifulsoup

>>> BeautifulSoup('<span>this is a</span>cat').text
u'this is acat'
>>> BeautifulSoup('Spelled f<b>o</b>etus in British English with extra "o"').text
u'Spelled foetus in British English with extra "o"'

标记标记之间的一些分析应该在它们之间留出空格(就像那个acat)。有什么方法可以确保解析器在任何有意义的地方放置空格?我正在尝试将电子邮件转换为文本。

2 个答案:

答案 0 :(得分:2)

没关系,我错了:

def grab(soup):
    return ' '.join(unicode(i.string) for i in soup.body.contents)
           # soup.body.contents contains a list of all the tags
           # [<span>this is a</span>, u'cat']
           # [<p>Spelled f<b>o</b>etus in British English with extra "o"</p>]

           # i.string gets the text of a tag, similar to .text, but if there are tags in the tag you want to get the .string of, it will return None.

           # unicode() is used to convert it from a bs4 type to a string type. Used to call ' '.join()
           # It's good to use unicode() instead of str():
           ## If you want to use a NavigableString outside of Beautiful Soup, 
           ## you should call unicode() on it to turn it into a normal 
           ## Python Unicode string. If you don’t, your string will carry around 
           ## a reference to the entire Beautiful Soup parse tree, even when 
           ## you’re done using Beautiful Soup. This is a big waste of memory.

           # Lastly, as .contents returns a list, we join it together.

soup1 = BeautifulSoup('<span>this is a</span>cat')
soup2 = BeautifulSoup('Spelled f<b>o</b>etus in British English with extra "o"')
soups = [soup1, soup2] # here we have a list of the soups
for i in soups:
    result = grab(i) # It will be either u'None', or the correct string with a space
    if result == 'None': # If the result had a tag in between (i.e, like your second example)
        print i.text
    else:
        print result # The result with a space.

打印:

this is a cat
Spelled foetus in British English with extra "o"

答案 1 :(得分:0)

根据评论进行了编辑:

BeautifulSoup支持第一个示例。您要做的就是

BeautifulSoup('<span>this is a</span>cat').get_text(" ")

它将使用空格将两个元素之间的文本连接在一起。已记录here