从文件中读取每个非英语字符

时间:2016-05-12 04:00:31

标签: c# string encoding character non-english

让我们说一个文件有非英文文本。我们可以使用FileIO.ReadLinesAsync方法读取文件内容。现在每行包含一组字符。如何从这个字符串中提取每个字母(非英文字母)?在这里,我用C#代码表示我的问题。

   List<string> finalAlphabets = new List<string>();
        IList<string> alphabetLines = await FileIO.ReadLinesAsync(_languageFile,UnicodeEncoding.Utf8);
        if (alphabetLines.Count != 0)
        {
            foreach (string alphabetLine in alphabetLines)
            {
                //lets say alphabetLine has "కాకికు", here i want to extract each letter from this and i want to add to finalAlphabets list 
                finalAlphabets.Add("కా"); // How to extract this letter from alphabetLine variable. If you look at the Length of alphabetLine , it shows 6, but actually in Telugu language it is 3 letter word.             
            }
        }

1 个答案:

答案 0 :(得分:0)

有一组文字信息类 - TextInfoStringInfo,特别是你可能正在寻找TextElementEnumerator,让人们找到&#34;文本元素&#34;边界。

来自MSDN文章的简化示例:

var myTEE = System.Globalization.StringInfo.GetTextElementEnumerator( "కాకికు");
while (myTEE.MoveNext())  {
     Console.WriteLine( "[{0}]:\t{1}\t{2}", 
         myTEE.ElementIndex, myTEE.Current, myTEE.GetTextElement() );
}

产生以下输出:

[0]:  కా  కా
[2]:  కి  కి
[4]:  కు  కు