Question

如何以文化独立的方式匹配单词而非字母？

\w匹配单词或数字，但我想忽略数字。因此，使用\w\s的“111或this”将无效。

我想只获得“或者这个”？我猜{^[A-Za-z]+$}不是解决方案，因为德语字母表有一些额外的字母。

Answer 1

这适用于匹配单词：

\b[^\d\s]+\b

故障：

\b  -  word boundary
[   -  start of character class
^   -  negation within character class
\d  -  numerals
\s  -  whitespace
]   -  end of character class
+   -  repeat previous character one or more times
\b  -  word boundary

这将匹配由单词边界分隔的任何内容，特别是排除数字和空格（因此“aa？aa！aa”之类的“单词”将匹配）。

或者，如果您也想要排除这些，可以使用：

\b[\p{L}\p{M}]+\b

故障：

\b    -  word boundary
[     -  start of character class
\p{L} -  single code point in the category "letter"
\p{M} -  code point that is a combining mark (such as diacritics)
]     -  end of character class
+     -  repeat previous character one or more times
\b    -  word boundary

Answer 2

使用此表达式\b[\p{L}\p{M}]+\b。它使用不那么熟知的符号来匹配指定类别的unicode字符（代码点）。因此\p{L}将匹配所有字母，\p{M}将匹配所有组合标记。后者是必需的，因为有时重音字符可能用两个代码点编码（字母本身+组合标记），\p{L}在这种情况下仅匹配其中一个。

另请注意，这是匹配可能包含国际字符的单词的一般表达式。例如，如果您需要一次匹配多个单词或允许以数字结尾的单词，则必须相应地修改此模式。

Answer 3

我建议使用这个：

foundMatch = Regex.IsMatch(SubjectString, @"\b[\p{L}\p{M}]+\b");

仅匹配所有unicode 字母。

虽然@Oded的回答也可能有效，但它也与之匹配：p+ü+üü++üüü++ü这不是一个单词。

<强>解释

"
\b              # Assert position at a word boundary
[\p{L}\p{M}]    # Match a single character present in the list below
                   # A character with the Unicode property “letter” (any kind of letter from any language)
                   # A character with the Unicode property “mark” (a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.))
   +               # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b              # Assert position at a word boundary
"

Answer 4

我认为正则表达式会是[^ \ d \ s] +。即不是数字或空格字符。

正则表达式单词匹配

4 个答案: