如何在Regex中使用Unicode

时间:2016-06-23 09:17:41

标签: asp.net regex unicode

我正在编写一个正则表达式来查找与文本文件中的Unicode字符匹配的行

!Regex.IsMatch(colCount.line, @"^"[\p{IsBasicLatin}\p{IsLatinExtended-A}\p{IsLatinExtended-B}]"+$")

下面是我写的完整代码

var _fileName = @"C:\text.txt";

BadLinesLst = File
              .ReadLines(_fileName, Encoding.UTF8) 
              .Select((line, index) =>
               {
                 var count = line.Count(c => Delimiter == c) + 1;
                     if (NumberOfColumns < 0)
                           NumberOfColumns = count;

                             return new
                             {
                                 line = line,
                                 count = count,
                                 index = index
                             };
               })
               .Where(colCount => colCount.count != NumberOfColumns || (Regex.IsMatch(colCount.line, @"[^\p{IsBasicLatin}\p{IsLatinExtended-A}\p{IsLatinExtended-B}]")))
               .Select(colCount => colCount.line).ToList();

文件包含以下行

264162-03,66,JITK,2007,12,874.000,0.000,0.000

6420œ50-00,67,JITK,2007,12,2292.000,0.000,0.000

4804¥75-00,67,JITK,2007,12,1810.000,0.000,0.000

如果行的文件包含除BasicLatin或LatinExtended-A或LatinExtended-B之外的任何其他字符,那么我需要获取这些行。 上面的Regex工作不正常,这显示了那些包含LatinExtended-A或B

的行

1 个答案:

答案 0 :(得分:1)

您需要将Unicode类别类放入negated character class

if (Regex.IsMatch(colCount.line, 
         @"[^\p{IsBasicLatin}\p{IsLatinExtended-A}\p{IsLatinExtended-B}]")) 
{ /* Do sth here */ }

此正则表达式将找到部分匹配(因为Regex.IsMatch在较大字符串中找到模式匹配)。该模式将匹配除\p{IsBasicLatin}\p{IsLatinExtended-A}\p{IsLatinExtended-B} Unicode类别集之外的任何字符。

您可能还想查看以下代码:

if (Regex.IsMatch(colCount.line, 
     @"^[^\p{IsBasicLatin}\p{IsLatinExtended-A}\p{IsLatinExtended-B}]*$")) 
{ /* Do sth here */ }

如果整个colCount.line字符串不包含否定字符类 -or - 中指定的3个Unicode类别中的任何字符,则返回true如果字符串为空(如果您想禁止获取空字符串,请在结尾处将*替换为+)。