我们的一些用户使用无法处理Unicode的电子邮件客户端,即使在邮件标头中正确设置了编码等。
我想“规范化”他们收到的内容。我们遇到的最大问题是用户将来自Microsoft Word的内容复制到我们的Web应用程序中,然后通过电子邮件转发该内容 - 包括分数,智能引号以及Word为您帮助插入的所有其他扩展Unicode字符
我猜这里没有明确的解决方案,但在我坐下来开始编写伟大的查找表之前,是否有一些内置的方法可以帮助我开始?
基本上涉及三个阶段。
首先,从其他正常字母中删除重音 - 解析此is here
This paragraph contains “smart quotes” and áccénts and ½ of the problem is fractions
转到
This paragraph contains “smart quotes” and accents and ½ of the problem is fractions
其次,用等效的ASCII替换单个Unicode字符,给出:
This paragraph contains "smart quotes" and accents and ½ of the problem is fractions
在我实现自己的解决方案之前,这是我希望有解决方案的部分。最后,使用合适的ASCII序列替换特定字符 - ½到1/2,依此类推 - 我很确定任何类型的Unicode魔法本身都不支持,但是有人可能已经写了一个合适的查找表我可以重复使用。
有什么想法吗?
答案 0 :(得分:17)
谢谢大家的一些非常有用的答案。我意识到实际的问题不是“如何将任何Unicode字符转换为其ASCII后备” - 问题是“我如何将我的客户抱怨转换为他们的ASCII后备” ?
换句话说 - 我们不需要通用的解决方案;对于讲英语的客户,将Word和其他网站的英语内容粘贴到我们的应用程序中,我们需要一种能够在99%的时间内正常工作的解决方案。为此,我分析了八年来通过我们的系统发送的消息,使用此测试查找ASCII编码中无法表示的字符:
///<summary>Determine whether the supplied character is
///using ASCII encoding.</summary>
bool IsAscii(char inputChar) {
var ascii = new ASCIIEncoding();
var asciiChar = (char)(ascii.GetBytes(inputChar.ToString())[0]);
return(asciiChar == inputChar);
}
然后我经历了由此产生的无法代表的字符集,并手动分配了一个合适的替换字符串。整个批次都捆绑在扩展方法中,因此您可以调用myString.Asciify()将您的字符串转换为合理的ASCII编码近似值。
public static class StringExtensions {
private static readonly Dictionary<char, string> Replacements = new Dictionary<char, string>();
/// <summary>Returns the specified string with characters not representable in ASCII codepage 437 converted to a suitable representative equivalent. Yes, this is lossy.</summary>
/// <param name="s">A string.</param>
/// <returns>The supplied string, with smart quotes, fractions, accents and punctuation marks 'normalized' to ASCII equivalents.</returns>
/// <remarks>This method is lossy. It's a bit of a hack that we use to get clean ASCII text for sending to downlevel e-mail clients.</remarks>
public static string Asciify(this string s) {
return (String.Join(String.Empty, s.Select(c => Asciify(c)).ToArray()));
}
private static string Asciify(char x) {
return Replacements.ContainsKey(x) ? (Replacements[x]) : (x.ToString());
}
static StringExtensions() {
Replacements['’'] = "'"; // 75151 occurrences
Replacements['–'] = "-"; // 23018 occurrences
Replacements['‘'] = "'"; // 9783 occurrences
Replacements['”'] = "\""; // 6938 occurrences
Replacements['“'] = "\""; // 6165 occurrences
Replacements['…'] = "..."; // 5547 occurrences
Replacements['£'] = "GBP"; // 3993 occurrences
Replacements['•'] = "*"; // 2371 occurrences
Replacements[' '] = " "; // 1529 occurrences
Replacements['é'] = "e"; // 878 occurrences
Replacements['ï'] = "i"; // 328 occurrences
Replacements['´'] = "'"; // 226 occurrences
Replacements['—'] = "-"; // 133 occurrences
Replacements['·'] = "*"; // 132 occurrences
Replacements['„'] = "\""; // 102 occurrences
Replacements['€'] = "EUR"; // 95 occurrences
Replacements['®'] = "(R)"; // 91 occurrences
Replacements['¹'] = "(1)"; // 80 occurrences
Replacements['«'] = "\""; // 79 occurrences
Replacements['è'] = "e"; // 79 occurrences
Replacements['á'] = "a"; // 55 occurrences
Replacements['™'] = "TM"; // 54 occurrences
Replacements['»'] = "\""; // 52 occurrences
Replacements['ç'] = "c"; // 52 occurrences
Replacements['½'] = "1/2"; // 48 occurrences
Replacements[''] = "-"; // 39 occurrences
Replacements['°'] = " degrees "; // 33 occurrences
Replacements['ä'] = "a"; // 33 occurrences
Replacements['É'] = "E"; // 31 occurrences
Replacements['‚'] = ","; // 31 occurrences
Replacements['ü'] = "u"; // 30 occurrences
Replacements['í'] = "i"; // 28 occurrences
Replacements['ë'] = "e"; // 26 occurrences
Replacements['ö'] = "o"; // 19 occurrences
Replacements['à'] = "a"; // 19 occurrences
Replacements['¬'] = " "; // 17 occurrences
Replacements['ó'] = "o"; // 15 occurrences
Replacements['â'] = "a"; // 13 occurrences
Replacements['ñ'] = "n"; // 13 occurrences
Replacements['ô'] = "o"; // 10 occurrences
Replacements['¨'] = ""; // 10 occurrences
Replacements['å'] = "a"; // 8 occurrences
Replacements['ã'] = "a"; // 8 occurrences
Replacements['ˆ'] = ""; // 8 occurrences
Replacements['©'] = "(c)"; // 6 occurrences
Replacements['Ä'] = "A"; // 6 occurrences
Replacements['Ï'] = "I"; // 5 occurrences
Replacements['ò'] = "o"; // 5 occurrences
Replacements['ê'] = "e"; // 5 occurrences
Replacements['î'] = "i"; // 5 occurrences
Replacements['Ü'] = "U"; // 5 occurrences
Replacements['Á'] = "A"; // 5 occurrences
Replacements['ß'] = "ss"; // 4 occurrences
Replacements['¾'] = "3/4"; // 4 occurrences
Replacements['È'] = "E"; // 4 occurrences
Replacements['¼'] = "1/4"; // 3 occurrences
Replacements['†'] = "+"; // 3 occurrences
Replacements['³'] = "'"; // 3 occurrences
Replacements['²'] = "'"; // 3 occurrences
Replacements['Ø'] = "O"; // 2 occurrences
Replacements['¸'] = ","; // 2 occurrences
Replacements['Ë'] = "E"; // 2 occurrences
Replacements['ú'] = "u"; // 2 occurrences
Replacements['Ö'] = "O"; // 2 occurrences
Replacements['û'] = "u"; // 2 occurrences
Replacements['Ú'] = "U"; // 2 occurrences
Replacements['Œ'] = "Oe"; // 2 occurrences
Replacements['º'] = "?"; // 1 occurrences
Replacements['‰'] = "0/00"; // 1 occurrences
Replacements['Å'] = "A"; // 1 occurrences
Replacements['ø'] = "o"; // 1 occurrences
Replacements['˜'] = "~"; // 1 occurrences
Replacements['æ'] = "ae"; // 1 occurrences
Replacements['ù'] = "u"; // 1 occurrences
Replacements['‹'] = "<"; // 1 occurrences
Replacements['±'] = "+/-"; // 1 occurrences
}
}
请注意,那里有一些相当奇怪的后备 - 就像这一个:
Replacements['³'] = "'"; // 3 occurrences
Replacements['²'] = "'"; // 3 occurrences
那是因为我们的一个用户有一些程序可以将开/关智能引号转换成²和³(比如:他说'helhel³)并且没有人用它们代表取幂,所以这对我们来说可能会很好用,但是YMMV。
答案 1 :(得分:5)
我自己遇到了一些问题,同时使用了最初在Word中构建的字符串列表。我发现使用简单的"String".replace(current char/string, new char/string)
命令可以很好地工作。我使用的确切代码是智能引号,或者确切地说:左“,右”,“左”和“右”如下:
StringName = StringName.Replace(ChrW(8216), "'") ' Replaces any left ' with a normal '
StringName = StringName.Replace(ChrW(8217), "'") ' Replaces any right ' with a normal '
StringName = StringName.Replace(ChrW(8220), """") ' Replace any left " with a normal "
StringName = StringName.Replace(ChrW(8221), """") ' Replace any right " with a normal "
我希望这可以帮助那些仍有这个问题的人!
答案 2 :(得分:1)
是否有一些内置的方法 让我开始?
我要尝试的第一件事是使用normalization form字符串方法将文本转换为NFKD Normalize。在您链接的问题的答案中提到了这个建议,但我建议使用NFKD而不是NFD,因为NFKD将删除不需要的印刷区别(例如,NBSP→空格或ℂ→C)。
您也可以通过Unicode category进行通用替换。例如,Pd可以替换为-
,Nd可以替换为相应的0
- 9
数字,而Mn可以替换为空字符串(以删除重音)。 / p>
但有人可能会写一篇文章 合适的查找表我可以重复使用。
您可以尝试使用Unidecode程序中的数据,或CLDR。
答案 3 :(得分:0)
你永远不应该尝试将Unicode转换为ASCII,因为你最终会遇到比解决问题更多的问题。
这就像尝试将1,114,112个代码点(Unicode 6.0)整合到128个字符中一样。
你认为你会成功吗?
BTW,Unicode中有很多引号,不仅是你提到的引用,而且如果你想要进行转换,请记住转换将取决于语言环境。检查ICU - 包含最完整的Unicode转换例程。