Need to remove illegal characters in XML string

时间:2016-12-09 12:50:52

标签: c# xml

I have to process xml data in C#, however, sometimes there is an illegal XML character present. For example this XML code will not parse as it is invalid:

<xml>Another way to write a heart is <3</xml>

The XML parser will throw an error because it is not valid, which makes sense. Although, I don't seem to find a way to replace that only one "<" to "& lt;" so that the parser will receive:

<xml>Another way to write a heart is &lt;3</xml>

Footnote: it can occure in any node in the xml which can be pretty large itself and like I said before, it happens not all the time...

Is there a function tthat can handle this?

3 个答案:

答案 0 :(得分:2)

There is no general solution to this, because you have no way of determining whether:

<xml>You can use <b></b> to highlight stuff in HTML.</xml>.

is a "mistake" and should actually be encoded:

<xml>You can use &lt;b&gt;&lt;/b&gt; to highlight stuff in HTML.</xml>.

or not.

Thus, since there is no general solution, you can only use imperfect heuristics to detect such issues.

There is no built-in heuristic in the C# BCL, you will have to roll your own or find some external library. A simple heuristic, for example, would be to find all < which are not followed by [/a-zA-Z0-9]+> and escape them.

Heuristics are intrinsically imperfect, so if you have the opportunity to fix the system creating those broken looks-like-XML-but-isn't files, this would be a much better solution.

答案 1 :(得分:1)

I am copy pasting from this previous answer by @IgorKustov, over here.

As the way to remove invalid XML characters I suggest you to use XmlConvert.IsXmlChar method. It was added since .NET Framework 4 and is presented in Silverlight too. Here is the small sample:

void Main() {
    string content = "\v\f\0";
    Console.WriteLine(IsValidXmlString(content)); // False

    content = RemoveInvalidXmlChars(content);
    Console.WriteLine(IsValidXmlString(content)); // True
}

static string RemoveInvalidXmlChars(string text) {
    var validXmlChars = text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray();
    return new string(validXmlChars);
}

static bool IsValidXmlString(string text) {
    try {
        XmlConvert.VerifyXmlChars(text);
        return true;
    } catch {
        return false;
    }
}

And as the way to escape invalid XML characters I suggest you to use XmlConvert.EncodeName method. Here is the small sample:

void Main() {
    const string content = "\v\f\0";
    Console.WriteLine(IsValidXmlString(content)); // False

    string encoded = XmlConvert.EncodeName(content);
    Console.WriteLine(IsValidXmlString(encoded)); // True

    string decoded = XmlConvert.DecodeName(encoded);
    Console.WriteLine(content == decoded); // True
}

static bool IsValidXmlString(string text) {
    try {
        XmlConvert.VerifyXmlChars(text);
        return true;
    } catch {
        return false;
    }
}

Update: It should be mentioned that the encoding operation produces a string with a length is greater or equal than a length of a source string. It can be important when you store a encoded string in a database in a string column with length limitation and validate source string length in your app to fit data column limitation.

答案 2 :(得分:0)

Check this link you could use regex to repair the xml string. This is the code from the link:

public static String repair(String xml) {
    Pattern pattern = Pattern.compile("(<attribute name=\"[^\"]+\">)(.*?)(</attribute>)");
    Matcher m = pattern.matcher(xml);
    StringBuffer buf = new StringBuffer(xml.length() + xml.length() / 32);
    while (m.find()) {
        String escaped = StringEscapeUtils.escapeXml(m.group(2));
        m.appendReplacement(buf, m.group(1) + escaped + m.group(3));
    }
    m.appendTail(buf);
    return buf.toString();
}

Depending on the size of your xml string the performance could be an issue. But atleast in my knowledge there is no parser that can read xml with illegal chars and remove them.