Question

我正在使用此代码下载Xml文件。

String url="https://www.sec.gov/Archives/edgar/data/16160/000001616016000061/calm-20160528.xml";

            String fileName = url.substring(url.lastIndexOf("/") + 1,
                    url.length());

            String completeFileLocationWithName="/home/user/Downloads/XBRLCODE/"+fileName;

            URL surl = new URL(url);
            con = surl.openConnection();
            con.setConnectTimeout(0);
            con.setReadTimeout(0);
            InputStream in = con.getInputStream();
            Files.copy(in, Paths.get(completeFileLocationWithName));*/

并尝试使用String escapedInput = StringEscapeUtils.escapeXml(appNameInput);

INPUT是：URL

OUTPUT是在下载XML时，不应该有<，>，&之类的上述字符 - 而是＆lt; ，＆gt; ，＆安培;对我来说没问题..

请有人分享这方面的知识..

Answer 1

使用 commons-lang.jar 库中的 StringEscapeUtils 。

这是工作代码：

import java.io.IOException;
import java.io.InputStream;
import java.io.StringWriter;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang.StringEscapeUtils;

public class Test {

    public static void main(String[] args) {
        String url = "https://www.sec.gov/Archives/edgar/data/16160/000001616016000061/calm-20160528.xml";

        URL surl;
        try {
            surl = new URL(url);
            URLConnection con = surl.openConnection();
            con.setConnectTimeout(0);
            con.setReadTimeout(0);
            InputStream in = con.getInputStream();
            StringWriter writer = new StringWriter();
            IOUtils.copy(in, writer, "UTF-8");
            System.out.println(StringEscapeUtils.unescapeHtml(writer.toString()));
        } catch (MalformedURLException ex) {
            Logger.getLogger(Test.class.getName()).log(Level.SEVERE, null, ex);
        } catch (IOException ex) {
            Logger.getLogger(Test.class.getName()).log(Level.SEVERE, null, ex);
        }

    }
}

输出没有转义字符，这里是来自控制台的示例：

<td valign="bottom" style="width:02.96%;border-top:1pt none #D9D9D9 ;border-left:1pt none #D9D9D9 ;border-bottom:1pt none #D9D9D9 ;border-right:1pt none #D9D9D9 ;background-color: #auto;height:1.00pt;padding:0pt;">
                    <p style="margin:0pt;font-family:Times New Roman;height:1.00pt;overflow:hidden;font-size:0pt;">
                        &nbsp;</p>
                </td>
                <td valign="bottom" style="width:02.40%;border-top:1pt none #D9D9D9 ;border-left:1pt none #D9D9D9 ;border-bottom:1pt none #D9D9D9 ;border-right:1pt none #D9D9D9 ;background-color: #auto;height:1.00pt;padding:0pt;">
                    <p style="margin:0pt;font-family:Times New Roman;height:1.00pt;overflow:hidden;font-size:0pt;">
                        &nbsp;</p>
                </td>
                <td valign="bottom" style="width:11.82%;border-top:1pt none #D9D9D9 ;border-left:1pt none #D9D9D9 ;border-bottom:1pt none #D9D9D9 ;border-right:1pt none #D9D9D9 ;background-color: #auto;height:1.00pt;padding:0pt;">
                    <p style="margin:0pt;font-family:Times New Roman;height:1.00pt;overflow:hidden;font-size:0pt;">
                        &nbsp;</p>
                </td>

请记住，您需要：

import org.apache.commons.io.IOUtils;
import org.apache.commons.lang.StringEscapeUtils;

Answer 2

我认为你会稍微误解这个问题。这里的XML包含嵌入式 HTML（本身带有嵌入式CSS，正好发生）。

要包含在该节点中，这些字符要进行转义，否则整个XML将无效（<，>，&等都是reserved entities in XML）。

如果您的意思是希望 XML节点（us-gaap:FiscalPeriod） un 的结果被转义，那么您应该提取其字符串值，然后使用类似的内容已提出StringEscapeUtils.unescapeHtml。

根据您尝试做的事情，您可能希望从输出中继续前进strip all HTML tags。

Answer 3

以下似乎有效。

    InputStream iStream = new FileInputStream(new File("xxxxx"));
    StringWriter writer = new StringWriter();
    IOUtils.copy(iStream, writer, "UTF-8");
    String theString = writer.toString();
    IOUtils.write(StringEscapeUtils.unescapeXml(theString),
            new FileOutputStream("yyyy"));

如何通过转义特殊字符（例如＆lt;）从URL下载XML文件＆GT; $放大器;等等？

3 个答案: