我正在编写一个正则表达式来查找与文本文件中的Unicode字符匹配的行
!Regex.IsMatch(colCount.line, @"^"[\p{IsBasicLatin}\p{IsLatinExtended-A}\p{IsLatinExtended-B}]"+$")
下面是我写的完整代码
var _fileName = @"C:\text.txt";
BadLinesLst = File
.ReadLines(_fileName, Encoding.UTF8)
.Select((line, index) =>
{
var count = line.Count(c => Delimiter == c) + 1;
if (NumberOfColumns < 0)
NumberOfColumns = count;
return new
{
line = line,
count = count,
index = index
};
})
.Where(colCount => colCount.count != NumberOfColumns || (Regex.IsMatch(colCount.line, @"[^\p{IsBasicLatin}\p{IsLatinExtended-A}\p{IsLatinExtended-B}]")))
.Select(colCount => colCount.line).ToList();
文件包含以下行
264162-03,66,JITK,2007,12,874.000,0.000,0.000
6420œ50-00,67,JITK,2007,12,2292.000,0.000,0.000
4804¥75-00,67,JITK,2007,12,1810.000,0.000,0.000
如果行的文件包含除BasicLatin或LatinExtended-A或LatinExtended-B之外的任何其他字符,那么我需要获取这些行。 上面的Regex工作不正常,这显示了那些包含LatinExtended-A或B
的行答案 0 :(得分:1)
您需要将Unicode类别类放入negated character class:
if (Regex.IsMatch(colCount.line,
@"[^\p{IsBasicLatin}\p{IsLatinExtended-A}\p{IsLatinExtended-B}]"))
{ /* Do sth here */ }
此正则表达式将找到部分匹配(因为Regex.IsMatch
在较大字符串中找到模式匹配)。该模式将匹配除\p{IsBasicLatin}
,\p{IsLatinExtended-A}
和\p{IsLatinExtended-B}
Unicode类别集之外的任何字符。
您可能还想查看以下代码:
if (Regex.IsMatch(colCount.line,
@"^[^\p{IsBasicLatin}\p{IsLatinExtended-A}\p{IsLatinExtended-B}]*$"))
{ /* Do sth here */ }
如果整个colCount.line
字符串不包含否定字符类 -or - 中指定的3个Unicode类别中的任何字符,则返回true如果字符串为空(如果您想禁止获取空字符串,请在结尾处将*
替换为+
)。