Question

主要问题是获取html文件的内容并删除所有标签我之前读过这些问题：

1，2，3

阅读完所有内容后，我决定使用jsoup，这对我有帮助。我还意识到如何保持换行符并用换行符替换<p>标签现在我的问题是我有一个html文件，其中有一个<H1>标签，其中整个内容的标题可用，我想保留一个换行符但是jsoup，第一段恰好在标题之后没有任何换行。任何人都能帮助我吗？我有的HTML代码：

<DIV class="story-headline">
<H1 class="story-title">NFL 2014 predictions</H1>
</DIV>
<H3 class="story-deck">Our picks for playoff teams, surprises, Super Bowl</H3>
<P class="small lighttext">
<SPAN class="delimited">Posted: Sep 02, 2014 1:30 PM ET</SPAN>
<SPAN>Last Updated: Sep 04, 2014 10:27 AM ET</SPAN>
</P>

和输出是：

NFL 2014 predictionsOur picks for playoff teams, surprises, Super Bowl

Posted: Sep 02, 2014 1:30 PM ETLast Updated: Sep 04, 2014 10:27 AM ET

我希望它是：

NFL 2014 predictions  
Our picks for playoff teams, surprises, Super Bowl  
Posted: Sep 02, 2014 1:30 PM ET  
Last Updated: Sep 04, 2014 10:27 AM ET

Answer 1

您应该挂钩目标OutputSettings的{{1}}，请尝试以下操作：

Document

（可选）如果您对提供的输出不满意，可以在public class HtmlWithLineBreaks { public String getCleanHtml(Document document) { document.outputSettings(new Document.OutputSettings().prettyPrint(false)); //makes html() call preserve linebreaks and spacing return Jsoup.clean(document.html(), "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false)); } public static void main(String... args) { File input = new File("/path/to/some/input.html"); //Just replace the input with you own html file source Document document; try { document = Jsoup.parse(input, "UTF-8"); String printOut = new HtmlWithLineBreaks().getCleanHtml(document); System.out.println(printOut); } catch (IOException e) { e.printStackTrace(); } } } <h1>包装后插入自定义换行符：

<div>

如何在j2ee中使用换行符替换某些标签并删除其他标签

1 个答案: