Question

我正在使用JSoup来解析HTML文件并删除XML中无效的元素，因为我需要将XSLT应用于该文件。我遇到的问题是“nbsp;”存在于我的文档中。我需要将它们改为unicode'＃160;'这样我就可以在文件上运行XSLT了。

所以我想：

<p> &nbsp; </p> 
<p> &nbsp; </p> 
<p> &nbsp; </p> 
<p> &nbsp; </p>

要成为：

<p> &#160; </p> 
<p> &#160; </p> 
<p> &#160; </p> 
<p> &#160; </p>

我尝试使用文本替换，但它不起作用：

Elements els = doc.body().getAllElements();
for (Element e : els) {
    List<TextNode> tnList = e.textNodes();
    for (TextNode tn : tnList){
        String orig = tn.text();
        tn.text(orig.replaceAll("&nbsp;","&#160;")); 
    }
}

执行解析的代码：

File f = new File ("C:/Users/jrothst/Desktop/Test File.htm");

Document doc = Jsoup.parse(f, "UTF-8");
doc.outputSettings().syntax( Document.OutputSettings.Syntax.xml );  
System.out.println("Starting parse..");
performConversion(doc);

String html = doc.toString();
System.out.println(html);
FileUtils.writeStringToFile(f, doc.outerHtml(), "UTF-8");

如何使用JSoup库进行更改？

Answer 1

以下对我有用。您不需要进行任何手动搜索和替换：

File f = new File ("C:/Users/seanbright/Desktop/Test File.htm");

Document doc = Jsoup.parse(f, "UTF-8");
doc.outputSettings()
    .syntax(Document.OutputSettings.Syntax.xml)
    .escapeMode(Entities.EscapeMode.xhtml);

System.out.println(doc.toString());

输入：

<html><head></head><body>&nbsp;</body></html>

输出：

<html><head></head><body>&#xa0;</body></html>

（ 与 的内容相同，只是十六进制而不是十进制）

如何使用JSoup在HTML中更改''到''

1 个答案: