Question

我有一些xml文件，其中一些控制序列包含在文本中：EOT，ETX（anotherchar） EOT逗号ETX之后的其他字符并不总是存在，也不总是相同。实际示例：

{
  "name": null,
  "conditions": [
    {
      "type": "QUESTION",
      "question": {
      }
    },
    {
      "type": "QUESTION",
      "question": {
      }
    },
    {
      "type": "FIELD",
      "question": {
      }
    }
  ],
  "expression": "A"
}

<FatturaElettronicaHeader xmlns=""> </F<EOT>‚<ETX>èatturaElettronicaHeader>是04字符，<EOT>是03。因为我必须解析xml，所以实际上这是一个大问题。这是我从未听说过的某种编码吗？

我试图从我的字符串中删除所有控制字符，但是它将留下仍然不需要的逗号。如果我使用<ETX>，不需要的字符将被替换为'？'可以轻松删除，但仍会留下一些不需要的字符，导致解析问题：

Encoding.ASCII.GetString(file);这样的东西。

<BIC></WBIC>

因此，我需要删除所有此类控制字符序列才能解析此类文件，而且我不确定如何以编程方式检查字符是否属于控制序列。

Answer 1

我发现我的文件中有2个错误的模式：第一个是标题中的模式，第二个是EOT<。为了使其正常工作，我查看了以下线程：Remove substring that starts with SOT and ends EOT, from string

并修改了一些代码

private static string RemoveInvalidCharacters(string input)
        {
            while (true)
            {
                var start = input.IndexOf('\u0004');
                if (start == -1) break;
                if (input[start + 1] == '<')
                {
                    input = input.Remove(start, 2);
                    continue;
                }
                if (input[start + 2] == '\u0003')
                {
                    input = input.Remove(start, 4);
                }
            }
            return input;
        }

使用此代码进一步清理：

static string StripExtended(string arg)
        {
            StringBuilder buffer = new StringBuilder(arg.Length); //Max length
            foreach (char ch in arg)
            {
                UInt16 num = Convert.ToUInt16(ch);//In .NET, chars are UTF-16
                //The basic characters have the same code points as ASCII, and the extended characters are bigger
                if ((num >= 32u) && (num <= 126u)) buffer.Append(ch);
            }
            return buffer.ToString();
        }

现在一切看起来都很好解析。

Answer 2

很抱歉延迟回复，但我认为问题的根源可能是p7m文件的解码不正确。我认为本来要清理的xml文件是.xml.p7m文件。我相信清除文件的正确方法是使用Java或dotnet中的Buoncycastle之类的库以及CmsSignedData类。

                CmsSignedData cmsObj = new CmsSignedData(content);
                if (cmsObj.SignedContent != null)
                {
                    using (var stream = new MemoryStream())
                    {
                        cmsObj.SignedContent.Write(stream);
                        content = stream.ToArray();
                    }
                }

从字符串EOT逗号ETX中删除控制字符序列

2 个答案: