控制代码0x6导致XML错误

时间:2011-11-25 08:43:59

标签: java xml unicode saxparser

我运行的Java应用程序通过XML获取数据,但偶尔会有一些数据包含某种控制代码?

An invalid XML character (Unicode: 0x6) was found in the CDATA section.
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x6) was found in     the CDATA section.
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
    at domain.Main.processLogFromUrl(Main.java:342)
    at domain.Main.<init>(Main.java:67)
    at domain.Main.main(Main.java:577)

任何人都可以解释一下这个控制代码的确切作用,因为我找不到太多信息吗?

提前致谢。

3 个答案:

答案 0 :(得分:1)

在SAX解析器获取数据之前,您需要编写一个FilterInputStream来过滤数据。它必须删除或重新编码坏数据。

Apache有一个super-flexible示例。你可能希望把一个更简单的组合在一起。

这是我的其中一个做其他清理工作,但我相信这将是一个良好的开端。

/* Cleans up often very bad xml. 
 * 
 * 1. Strips leading white space.
 * 2. Recodes &pound; etc to &#...;.
 * 3. Recodes lone & as &amp.
 * 
 */
public class XMLInputStream extends FilterInputStream {

  private static final int MIN_LENGTH = 2;
  // Everything we've read.
  StringBuilder red = new StringBuilder();
  // Data I have pushed back.
  StringBuilder pushBack = new StringBuilder();
  // How much we've given them.
  int given = 0;
  // How much we've read.
  int pulled = 0;

  public XMLInputStream(InputStream in) {
    super(in);
  }

  public int length() {
    // NB: This is a Troll length (i.e. it goes 1, 2, many) so 2 actually means "at least 2"

    try {
      StringBuilder s = read(MIN_LENGTH);
      pushBack.append(s);
      return s.length();
    } catch (IOException ex) {
      log.warning("Oops ", ex);
    }
    return 0;
  }

  private StringBuilder read(int n) throws IOException {
    // Input stream finished?
    boolean eof = false;
    // Read that many.
    StringBuilder s = new StringBuilder(n);
    while (s.length() < n && !eof) {
      // Always get from the pushBack buffer.
      if (pushBack.length() == 0) {
        // Read something from the stream into pushBack.
        eof = readIntoPushBack();
      }

      // Pushback only contains deliverable codes.
      if (pushBack.length() > 0) {
        // Grab one character
        s.append(pushBack.charAt(0));
        // Remove it from pushBack
        pushBack.deleteCharAt(0);
      }

    }
    return s;
  }

  // Returns false at eof.
  // Might not actually push back anything but usually will.
  private boolean readIntoPushBack() throws IOException {
    // File finished?
    boolean eof = false;
    // Next char.
    int ch = in.read();
    if (ch >= 0) {
      // Discard whitespace at start?
      if (!(pulled == 0 && isWhiteSpace(ch))) {
        // Good code.
        pulled += 1;
        // Parse out the &stuff;
        if (ch == '&') {
          // Process the &
          readAmpersand();
        } else {
          // Not an '&', just append.
          pushBack.append((char) ch);
        }
      }
    } else {
      // Hit end of file.
      eof = true;
    }
    return eof;
  }

  // Deal with an ampersand in the stream.
  private void readAmpersand() throws IOException {
    // Read the whole word, up to and including the ;
    StringBuilder reference = new StringBuilder();
    int ch;
    // Should end in a ';'
    for (ch = in.read(); isAlphaNumeric(ch); ch = in.read()) {
      reference.append((char) ch);
    }
    // Did we tidily finish?
    if (ch == ';') {
      // Yes! Translate it into a &#nnn; code.
      String code = XML.hash(reference);
      if (code != null) {
        // Keep it.
        pushBack.append(code);
      } else {
        throw new IOException("Invalid/Unknown reference '&" + reference + ";'");
      }
    } else {
      // Did not terminate properly! 
      // Perhaps an & on its own or a malformed reference.
      // Either way, escape the &
      pushBack.append("&amp;").append(reference).append((char) ch);
    }
  }

  private void given(CharSequence s, int wanted, int got) {
    // Keep track of what we've given them.
    red.append(s);
    given += got;
    log.finer("Given: [" + wanted + "," + got + "]-" + s);
  }

  @Override
  public int read() throws IOException {
    StringBuilder s = read(1);
    given(s, 1, 1);
    return s.length() > 0 ? s.charAt(0) : -1;
  }

  @Override
  public int read(byte[] data, int offset, int length) throws IOException {
    int n = 0;
    StringBuilder s = read(length);
    for (int i = 0; i < Math.min(length, s.length()); i++) {
      data[offset + i] = (byte) s.charAt(i);
      n += 1;
    }
    given(s, length, n);
    return n > 0 ? n : -1;
  }

  @Override
  public String toString() {
    String s = red.toString();
    String h = "";
    // Hex dump the small ones.
    if (s.length() < 8) {
      Separator sep = new Separator(" ");
      for (int i = 0; i < s.length(); i++) {
        h += sep.sep() + Integer.toHexString(s.charAt(i));
      }
    }
    return "[" + given + "]-\"" + s + "\"" + (h.length() > 0 ? " (" + h + ")" : "");
  }

  private boolean isWhiteSpace(int ch) {
    switch (ch) {
      case ' ':
      case '\r':
      case '\n':
      case '\t':
        return true;
    }
    return false;
  }

  private boolean isAlphaNumeric(int ch) {
    return ('a' <= ch && ch <= 'z') 
        || ('A' <= ch && ch <= 'Z') 
        || ('0' <= ch && ch <= '9');
  }
}

答案 1 :(得分:0)

你为什么得到这个角色将取决于数据的意义。 (显然它是ACK,但在文件中表示奇怪......)然而,重要的一点是它使XML无效 - 你根本无法用XML表示该字符。

来自XML 1.0 specsection 2.2

  

角色范围

/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Char       ::=   #x9 | #xA | #xD | [#x20-#xD7FF] 
                     | [#xE000-#xFFFD] | [#x10000-#x10FFFF] 

请注意,除了U + 0009(制表符),U + 000A(换行符)和U + 000D(回车符号)之外,这不包括U + 0020以下的Unicode值。

如果您对返回的数据有任何影响,则应将其更改为返回有效的XML。如果没有,在将其解析为XML之前,您必须对其进行一些预处理。你想要对不受欢迎的控制角色做什么取决于他们在你的情况下有什么意义。

答案 2 :(得分:-2)

尝试将XML定义为1.1版:

<?xml version="1.1"?>