Question

我有一个混合的希伯来语/英语字符串来解析。字符串是这样构建的：

[3 hebrew] [2 english 2] [1 hebrew],

因此，它可以读作：1 2 3，它存储为3 2 1（文件中的确切字节序列，在十六进制编辑器中双重检查，无论如何RTL只是display属性）。 .NET正则表达式解析器有RTL option，它（在为普通LTR文本提供时）从字符串的右侧开始处理。

我想知道，当应用此选项从字符串中提取[3希伯来语]和[2英语]部分，或检查[1希伯来语]是否匹配字符串的结尾时？是否有任何隐藏的细节或没有什么可担心的（比如处理任何带有特殊unicode字符的LTR字符串）？

另外，有人能推荐我一个好的RTL + LTR文本编辑器吗？（担心VS Express有时会显示文本错误，如果它甚至可以开始弄乱保存的字符串 - 我想重新检查文件而不再使用十六进制编辑器）

Answer 1

RightToLeft选项是指通过正则表达式所采用的字符序列的顺序，并且应该真正被称为LastToFirst，因为在希伯来语和阿拉伯语的情况下，它实际上是从左到右，并且使用混合的RLT和LTR文本，例如你描述的“从右到左”的表达方式更不合适。

这对速度有轻微影响（仅在搜索到的文本很大时才有意义）和使用startAt索引完成的正则表达式（在字符串中搜索早于startAt的字段）比后来的字符串）。

实例;让我们希望浏览者不会把这一点弄得太乱了：

string saying = "למכות is in כתר"; //Just because it amuses me that this is a saying whatever way round the browser puts malkuth and kether.
string kether = "כתר";
Console.WriteLine(new Regex(kether, RegexOptions.RightToLeft).IsMatch(saying));//True
Console.WriteLine(new Regex(kether, RegexOptions.None).IsMatch(saying));//True, perhaps minutely faster but so little that noise would hide it.
Console.WriteLine(new Regex(kether, RegexOptions.RightToLeft).IsMatch(saying, 2));//False
Console.WriteLine(new Regex(kether, RegexOptions.None).IsMatch(saying, 2));//True
//And to show that the ordering is codepoint rather than physical display ordering:
Console.WriteLine(new Regex("" + kether[0] + ".*" + kether[2]).IsMatch(saying));//True
Console.WriteLine(new Regex("" + kether[2] + ".*" + kether[0]).IsMatch(saying));//False

.NET的正则表达式引擎如何处理RTL + LTR混合字符串？

1 个答案: