所以,我目前很难实现一个能够区分两个HTML文件的差异工具。我做了一些研究,最后使用了DaisyDiff。由于这个工具现在看起来有点老了,我很难找到一些仍然有用的例子。我发现this quesion on Stackoverflow,因为我无法弄清楚,作为第3和第4个参数传递什么,并且它有所帮助。我的实施的当前状态:
String html1 = "<html class='foobar'>Hello</html>";
String html2 = "<html>Bye</html>";
try {
StringWriter finalResult = new StringWriter();
SAXTransformerFactory tf = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
TransformerHandler result = tf.newTransformerHandler();
result.getTransformer().setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
result.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
result.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
result.getTransformer().setOutputProperty(OutputKeys.ENCODING, "UTF-8");
result.setResult(new StreamResult(finalResult));
ContentHandler postProcess = result;
DaisyDiff.diffHTML(new InputSource(new StringReader(html1)), new InputSource(new StringReader(html2)), postProcess, null, Locale.GERMAN);
System.out.println(finalResult.toString());
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
问题是,它实际上只是对纯文本进行区分,但它完全从输入中删除了标记。例如,如果我将这两个字符串作为输入:
String first = "<div>Hello</div>"
String second = "<div>Bye</div>"
我希望这个输出:
<div><span class="removed">Hello</span><span class="added">Bye</span></div>
但我只是得到了这个:
<span class="removed">Hello</span><span class="added">Bye</span>
答案 0 :(得分:1)
所以,我终于开始工作了。在Github上找到this example code后,很清楚,问题不在于ContentHandler
,正如我所怀疑的那样。因此,如果任何人还需要区分一些HTML,并且不想浪费几天寻找一个好的(并且正常工作的)示例,那么我就是这样做的。
首先,您需要下载NekoHTML Dependency,它基本上是一个HTML解析器。
这是我的导入块的样子
import java.io.IOException;
import java.io.StringReader;
import java.io.StringWriter;
import java.util.Locale;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.sax.SAXTransformerFactory;
import javax.xml.transform.sax.TransformerHandler;
import javax.xml.transform.stream.StreamResult;
import org.outerj.daisy.diff.helper.NekoHtmlParser;
import org.outerj.daisy.diff.html.HTMLDiffer;
import org.outerj.daisy.diff.html.HtmlSaxDiffOutput;
import org.outerj.daisy.diff.html.TextNodeComparator;
import org.outerj.daisy.diff.html.dom.DomTreeBuilder;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
这是我对差异的完整实现,它不会删除实际的标记(请注意,这不是我的代码,我只是得到了上面链接的示例!):
public static String diffHtml(String first, String second) throws TransformerConfigurationException, IOException, SAXException {
StringWriter finalResult = new StringWriter();
SAXTransformerFactory tf = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
TransformerHandler result = tf.newTransformerHandler();
result.getTransformer().setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
result.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
result.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
result.getTransformer().setOutputProperty(OutputKeys.ENCODING, "UTF-8");
result.setResult(new StreamResult(finalResult));
ContentHandler postProcess = result;
Locale locale = Locale.getDefault();
String prefix = "diff";
NekoHtmlParser cleaner = new NekoHtmlParser();
InputSource oldSource = new InputSource(new StringReader(first));
InputSource newSource = new InputSource(new StringReader(second));
DomTreeBuilder oldHandler = new DomTreeBuilder();
cleaner.parse(oldSource, oldHandler);
TextNodeComparator leftComparator = new TextNodeComparator(oldHandler, locale);
DomTreeBuilder newHandler = new DomTreeBuilder();
cleaner.parse(newSource, newHandler);
TextNodeComparator rightComparator = new TextNodeComparator(newHandler, locale);
HtmlSaxDiffOutput output = new HtmlSaxDiffOutput(postProcess, prefix);
HTMLDiffer differ = new HTMLDiffer(output);
differ.diff(leftComparator, rightComparator);
System.out.println(finalResult.toString());
return finalResult.toString();
}
哦,如果您使用IProgressMonitor
界面收到错误,请注意,它已从org.eclipse.core.runtime
移至org.eclipse.equinox.common
,因此请记住使用正确的依赖关系。也偶然发现了这个小问题。我希望这有帮助!