Question

我有一个这样的字符串（字符串输入字符串=＆＃34; xyz＆＃39; s＆amp;＃123456，外部広告掲载费用你好吗？＆＃39;＆＃34;）。我想在C＃中使用正则表达式从该字符串中删除特殊字符。我需要这样的输出（xyzs 123456外部広告掲载费用你好吗），如果有可能请告诉我。

Answer 1

类似的东西：

string finalstring = Regex.Replace(inputstring, @"[^\p{L}\p{N}\s]", "");

这里是Unicode类别：https://msdn.microsoft.com/library/20bw873z.aspx

\p{L} are Letters
\p{N} are Numbers
\s are space characters

我否定了所有内容，因此删除了不属于所有三个类别的字符。

请注意，从技术上讲，我过分了......正则表达式将“接受”而不是“删除”其他脚本，所以如果遇到混合的英汉 - 日 - 阿拉伯字符串，中文和阿拉伯字符将留在地点。虽然删除“阿拉伯”字符很容易，但删除“中文”字符可能很复杂，因为有CJK Unified Ideographs ...

您可以从以下内容开始：

string finalstring = Regex.Replace(inputstring, @"[^\p{IsBasicLatin}\p{IsLatin-1Supplement}\p{IsLatinExtended-A}\p{IsLatinExtended-B}\p{IsLatinExtendedAdditional}\p{IsLatinExtendedAdditional}\p{IsHiragana}\p{IsKatakana}\p{IsCJKUnifiedIdeographs}\s]", "");

然后查看是否需要添加其他CJK块...（同一页面，“支持的命名块”部分）。这将删除“阿拉伯”（和其他脚本），但显然不会为CJK“问题”做任何事情。

Answer 2

您可以创建一个禁止字符的char数组，并使用两个for循环：

string inputstring =  "xyz's &#123456 , 外部広告掲載費用 how are you?'";
string outputstring = "";
char[] bannedCharacters = new char[]{'\'', '&', '#', ',', '?'};
bool isOk;

for(int i = 0; i < inputstring.Length; i++){
    isOk = true;
    for(int j = 0; j < bannedCharacters.Length; j++){
        if(inputString[i] == bannedCharacters[j]){
            isOk = false;
        }
    }
    if(isOk){
        outputstring += inputstring[i];
    }
}

使用C＃中的正则表达式从日语字符串中删除特殊字符

2 个答案: