Question

我有一些文件包含é,ã,Δ,Ù等特殊字符。我想将它们替换为NCR（十六进制）4位数值。我已经尝试过以下方法，但不确定这是否是实现目标的最快方式......

var entities = new[]
{
new { ser = "\u00E9", rep = @"&#x00E9;" },
new { ser = "\u00E3", rep = @"&#x00E3;" },
new { ser = "\u00EA", rep = @"&#x00EA;" },
new { ser = "\u00E1", rep = @"&#x00E1;" },
new { ser = "\u00C1", rep = @"&#x00C1;" },
new { ser = "\u00C9", rep = @"&#x00C9;" },
new { ser = "\u0394", rep = @"&#x0394;" },
new { ser = "\u03B1", rep = @"&#x03B1;" },
new { ser = "\u03B2", rep = @"&#x03B2;" },
new { ser = "\u00B1", rep = @"&#x00B1;" },
//... so on
};

var files = Directory.GetFiles(path, "*.xml");
foreach (var file in files)
{
    string txt = File.ReadAllText(file);

    foreach (var entity in entities)
    {
        if (Regex.IsMatch(txt, entity.ser))
        {
            txt = Regex.Replace(txt, entity.ser, entity.rep);
        }
    };
    File.WriteAllText(file, txt);
}

有更快捷的方法和更有效的方法吗？

Answer 1

根据评论，您希望将Unicode字符（例如Ù）替换为其Unicode值（＆amp;＃x00D9）。 Regex.Replace可能是实现这一目标的最佳方式。

以下是处理文件的循环：

var files = Directory.GetFiles(path, "*.xml");
foreach (var file in files)
{
    string txt = File.ReadAllText(file);

    string newTxt = Regex.Replace(
        txt,
        @"([^\u0000-\u007F]+)",
        HandleMatch);

    File.WriteAllText(file, newTxt);
}

以下是比赛评估员：

private static char[] replacements = new[]
{
    'ø',
    'Ù'
};

private static string HandleMatch(Match m)
{
    // The pattern for the Regex will only match a single character, so get that character
    char c = m.Value[0];

    // Check if this is one of the characters we want to replace
    if (!replacements.Contains(c))
    {
        return m.Value;
    }

    // Convert the character to the 4 hex digit code
    string code = ((int) c).ToString("X4");

    // Format and return the code
    return "&#x" + code;
}

在循环中，您只需要读入一次文件，然后Regex.Replace方法将处理输入中所有实例的替换。正则表达式的模式将匹配不在0x00 - 0x7f范围内的所有内容，这将是前255个字符（ASCII字符）。

如果您只需要替换特定的Unicode字符，则需要构建这些字符的列表，并根据该列表检查HandleMatch()函数中的“c”值。

对效果的评论 您正在尝试对一组文件执行选择性字符替换。至少，您必须将每个文件读入内存，然后检查每个字符以查看它是否符合您的标准。

更高效的选项可以构建一个字符查找表，然后是每个字符的替换字符串。如果你有一个需要更换的大型字符列表，那么这个表很快就会难以维护。你也可以在替换表中留下错误的风险，这将是更多的工作要找。

以最快的方式替换文件中的特殊字符？

1 个答案: