从html页面中删除html标签的最佳方法是什么?

时间:2013-11-18 04:38:37

标签: java html html-parsing

从html页面中删除html标记的最佳方法是什么?我只想要实际的文本,而不是html标签。我将文本存储在一个字符串中,不包含html标记。最简单的方法是什么?示例页面如下所示:

<HTML><HEAD>
<META NAME="Docdate" CONTENT="05/02/2011">
<META NAME="m_title" CONTENT="TWO SECURITY GUARDS HACKED TO DEATH DURING A FIGHT">
<META NAME="m_author" CONTENT="">
<TITLE>MALAYSIA NEWS -- GENERAL NEWS -- 05/02/2011 -- TWO SECURITY GUARDS HACKED TO DEATH DURING A FIGHT</TITLE>
</HEAD><BODY BACKGROUND="#FFFFFF">
<PRE>
05/02/2011

POLICE-FIGHT

TWO SECURITY GUARDS HACKED TO DEATH DURING A FIGHT





    KUALA LUMPUR, Feb 5 (Bernama) -- Two security guards were hacked to death in

a fight that broke out at Damansara Perdana construction site last night. 

    Both men, aged 20 and 26, were found dead at the scene with slash wounds on

their bodies in the 8.20pm incident. 

    Petaling Jaya OCPD ACP Arjunaidi Mohammed said the fight started following

an argument involving a security guard and several foreign workers at the site. 

    "One of them had an argument with several of the workers. He then called two

of his friends who are also security guards but working in other areas. 

    "A group of 12 to 15 foreign workers, carrying sharp weapons, then attacked

them," he told reporters at the scene today. 

    The other security guard managed to flee to safety, he added. 

    "The foreign workers had also left the area. We have picked up a security

guard in the area and two Indonesian workers to have their statements taken," he

said, adding that a manhunt was underway for the suspects. 

    -- BERNAMA 

    NMR AKT JS





</PRE>
<BODY></HTML>

1 个答案:

答案 0 :(得分:0)

使用像jsoup这样的库来解析HTML,然后遍历生成的DOM并输出文本节点。 “疯狂”中的真实HTML通常非常不正确,因此尝试自己可靠地执行此操作将是巨大的浪费。最好使用已经设计用于处理格式错误的HTML的库。