我遇到了一个问题,我希望使用Java将单个HTML文件拆分为多个HTML文件,html文件在单个HTML文件中有多个教科书章节,但我希望每个章节都在单个HTML文件中,每个可以使用带有一些id的h2标签来识别章节开始。附上一个示例HTML文件,我想将其拆分为多个HTML文件。
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta name="generator" content="HTML Tidy for Linux (vers 7 December 2008), see www.w3.org"/>
<title>Sample HTML</title>
<link rel="stylesheet" href="0.css" type="text/css"/>
<link rel="stylesheet" href="1.css" type="text/css"/>
<link rel="stylesheet" href="sample.css" type="text/css"/>
<meta name="generator" content="sample content"/>
</head>
<body><div class="c2"><br/>
<br/>
<br/>
<br/></div>
<h2 id="pg00007">Chapter 7</h2>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p><a id="link2HCH0008"><!-- H2 anchor --></a></p>
<div class="c2"><br/>
<br/>
<br/>
<br/></div>
<h2 id="pg00008">Chapter 8</h2>
p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p><a id="link2HCH0009"><!-- H2 anchor --></a></p>
<div class="c2"><br/>
<br/>
<br/>
<br/></div>
<h2 id="pg00009">Chapter 9</h2>
p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p><a id="link2HCH0010"><!-- H2 anchor --></a></p>
<div class="c2"><br/>
<br/>
<br/>
<br/></div>
<h2 id="pg00010">Chapter 10</h2>
p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p><a id="link2HCH0011"><!-- H2 anchor --></a></p>
</body></html>
答案 0 :(得分:1)
不完全确定它是否可行但我猜你可以使用像http://jsoup.org/这样的解析器并按如下方式使用它:
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements chapters = doc.select("h2");
然后你必须提取元素的内容并将其保存为新的HTML文件(包括正文等)
答案 1 :(得分:0)
最后我能够在这里做的是根据我在问题中的需要拆分html的解决方案
public class App {
public static void JsoupReader(){
File input = new File("src/resources/sample_book.htm.html");
try {
Document doc = Jsoup.parse(input, "UTF-8");
Element head = doc.select("head").first();
Element firstH2 = doc.select("h2").first();
Elements siblings = firstH2.siblingElements();
String h2Text = firstH2.html();
List<Element> elementsBetween = new ArrayList<Element>();
for(int i=1;i<siblings.size(); i++){
Element sibling = siblings.get(i);
if(!"h2".equals(sibling.tagName())){
elementsBetween.add(sibling);
}else{
processElementsBetween(h2Text, head, elementsBetween);
elementsBetween.clear();
h2Text = sibling.html();
}
}
if (! elementsBetween.isEmpty())
processElementsBetween(h2Text, head, elementsBetween);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private static void processElementsBetween(String h2Text,Element head,
List<Element> elementsBetween) throws IOException {
File newHtmlFile = new File("src/resources/"+h2Text+".html");
StringBuffer htmlString = new StringBuffer("");
htmlString.append("<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\">");
htmlString.append(head);
htmlString.append("<body>"
+"<div class=\"c2\">"
+"<br/>"
+"<br/>"
+"<br/>"
+"<br/>"
+"</div>");
System.out.println("---");
for (Element element : elementsBetween) {
htmlString.append(element.toString());
}
htmlString.append("</body></html>");
FileUtils.writeStringToFile(newHtmlFile, htmlString.toString());
}
感谢您的帮助uniknow 和realskeptic对你的批评。