应用错误收集

Matching non-alphanumeric characters excluding diacritics in EmEditor

时间：2016-08-31 17:09:46

标签： regex match combinations emeditor

I am trying to match any non-alphanumeric character with a Unicode-aware regex pattern and was trying to combine [\u00D8-\u00F6] and [^\w'’-] together. To no avail.

I have this: right ស្ដាំ sdam. And when I write [^\w'’-] in Find and replace dialog, it matches non-alphanumeric and part of the non-English character (ាំ and ្). I don't want to get those diacritics.

When I write [\u00D8-\u00F6], it will not match English characters, but it will match match some non-English characters and those decorated words like ាំ and ្.

1 个答案:

答案 0 :(得分：0)

你不能依赖默认的 Boost.Regex 引擎，它似乎在EmEditor中实现得很差。

转到高级并将正则表达式引擎更改为 Onigmo 。

然后使用[^\p{L}\p{M}\p{N}]（或[^\p{L}\p{M}\p{N}'’-]+一次匹配它们，并排除可能是单词部分的匹配'，’和- ）或您使用的任何其他正则表达式 - Unicode类别类将开始工作。

请注意，\w不支持Unicode，因此您需要使用\p{L}\p{M}\p{N}：

\p{L} - 来自BMP平面的任何Unicode字母
\p{M} - 任何变音符号
\p{N} - 任何Unicode数字

以及更多内容可以在UnicodeProps.txt文件中找到。