如何删除所有< ...>除了特定标签之外的HTML文件?

时间:2016-03-21 00:15:36

标签: java html string parsing

请参阅我有以下HTML文件。我想除去

之外的所有标签



<A href="MarineMammal.html">marine mammals.</A>
&#13;
&#13;
&#13;

我可以删除所有标签,但无法弄清楚如何保留特定标签。我希望能够获得上面标签周围的单词。这些单词不应包含标签。谢谢!

&#13;
&#13;
<TITLE> Whale </TITLE>
<H2> Whale </H2>
(from Wikipedia)

<p>
Whale is the common name for a widely distributed and diverse group of 
fully aquatic placental 
<A href="MarineMammal.html">marine mammals.</A>. They are an informal grouping 
within the infraorder <A href="Cetacean.html">Cetacea,</A> usually excluding 
<A href="Dolphin.html">dolphins</A> and 
<A href="Porpoise.html">porpoises.</A> 
Whales, dolphins and porpoises belong to the order Cetartiodactyla with 
even-toed 
<A href="Ungulate.html">ungulates</A> and their 
closest living relatives are the 
<A href="Hippopotamus.html">hippopotamuses,</A> having 
diverged about 40 million years ago. 
&#13;
&#13;
&#13;

1 个答案:

答案 0 :(得分:1)

有很多方法可以在this question

中讨论过

最简单的可能是this one

replaceAll("\\<[^>]*>","")