Web(怪异)将文本包装为纯文本字符串

时间:2014-06-22 22:51:03

标签: python

我正在尝试将一些包装文本转换为带有结束线和所有内容的平面文本字符串。 然而,包装是我从未见过的一种奇怪的东西。 该文本来自xml文件CDATA部分

<font color="#bfffffff" size="12"></font><font color="#ff00ff00" size="12">My fellow Muppets,<br><br>We are sorry to say that Devilish Intetions are not going to work out with The Muppet Brigade sorry guys you are just not active ebough I would how ever like to extend an arm to any players that would like to leave and join DynaCorp. If any of you are intrested just drop me a mail and best of luck in your future endevors. <br><br>o7 <br><br><br/></br></br></br></br></br></br></font><font color="#ff007fff" size="14">John Milbroc<br/></font><font color="#bfffffff" size="14">--------------------------<br/></font><font color="#ff007fff" size="14">The Muppet Brigade CEO</font>

我尝试了以下艰难的事情:

z = BeautifulSoup(string)
z.get_text()

然而BeautifulSoup似乎没有做任何事情。 我对python很新,很抱歉,如果这是一个非常简单的问题。

我想也许我的BeatifulSoup模块坏了,因为当我这样做时:

from bs4 import BeautifulSoup

html_doc ="""
Hi.<br><br>This is a message.<br><br>
"""
print(html_doc)

soup = BeautifulSoup(html_doc)

print(soup.text)

打印:

Hi.<br><br>This is a message.<br><br>

None

在尝试之后我搞砸了其他东西并发现如果你做了

soup.get_text()

而不是

soup.txt

它将打印已解析的文本。 非常奇怪,但它奏效了。 感谢您的鼓励,让我走在正确的轨道上。

2 个答案:

答案 0 :(得分:0)

删除文本中的<br>, <br/> and </br>标记,包装应该消失。这些是HTML中的换行符。

答案 1 :(得分:0)

为什么不使用BeautifulSoup解析html?例如:

html_doc = """
 ## you copy here your html text  
 """"

然后你解析它:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

您提取文字:

print soup.text


My fellow Muppets,We are sorry to say that Devilish Intetions are not 
going to work out with The Muppet Brigade sorry guys you are just not active ebough 
I would how ever like to extend an arm to any players that would like to leave and join DynaCorp. 
If any of you are intrested just drop me a mail and best of luck in your future endevors. 
o7 
John Milbroc--------------------------

The Muppet Brigade CEO