主要问题是获取html文件的内容并删除所有标签 我之前读过这些问题:
阅读完所有内容后,我决定使用jsoup
,这对我有帮助。我还意识到如何保持换行符并用换行符替换<p>
标签
现在我的问题是我有一个html文件,其中有一个<H1>
标签,其中整个内容的标题可用,我想保留一个换行符但是jsoup,第一段恰好在标题之后没有任何换行。任何人都能帮助我吗?
我有的HTML代码:
<DIV class="story-headline">
<H1 class="story-title">NFL 2014 predictions</H1>
</DIV>
<H3 class="story-deck">Our picks for playoff teams, surprises, Super Bowl</H3>
<P class="small lighttext">
<SPAN class="delimited">Posted: Sep 02, 2014 1:30 PM ET</SPAN>
<SPAN>Last Updated: Sep 04, 2014 10:27 AM ET</SPAN>
</P>
和输出是:
NFL 2014 predictionsOur picks for playoff teams, surprises, Super Bowl
Posted: Sep 02, 2014 1:30 PM ETLast Updated: Sep 04, 2014 10:27 AM ET
我希望它是:
NFL 2014 predictions
Our picks for playoff teams, surprises, Super Bowl
Posted: Sep 02, 2014 1:30 PM ET
Last Updated: Sep 04, 2014 10:27 AM ET
答案 0 :(得分:1)
您应该挂钩目标OutputSettings
的{{1}},请尝试以下操作:
Document
(可选)如果您对提供的输出不满意,可以在public class HtmlWithLineBreaks
{
public String getCleanHtml(Document document)
{
document.outputSettings(new Document.OutputSettings().prettyPrint(false)); //makes html() call preserve linebreaks and spacing
return Jsoup.clean(document.html(),
"",
Whitelist.none(),
new Document.OutputSettings().prettyPrint(false));
}
public static void main(String... args)
{
File input = new File("/path/to/some/input.html"); //Just replace the input with you own html file source
Document document;
try
{
document = Jsoup.parse(input, "UTF-8");
String printOut = new HtmlWithLineBreaks().getCleanHtml(document);
System.out.println(printOut);
} catch (IOException e)
{
e.printStackTrace();
}
}
}
<h1>
包装后插入自定义换行符:
<div>