Question

我正在用golang写一个正则表达式来捕获可能以不同语言显示的主题标签。例如，显而易见的是英语，但可能会有拉丁或阿拉伯用户尝试使用这些字符集创建主题标签。我知道unicode字符类的名称，但是如何在不为每个字符生成正则表达式的情况下一次使用多个字符呢？

示例代码：

r, err := regexp.Compile(`\B(\#[[:ascii:]]+\b)[^?!;]*`)

这将匹配"#hello #ذوق"并输出[]string{#hello, #ذوق}，但不匹配"#ذوق"

Answer 1

我建议使用

\B#[\p{L}\p{N}\p{M}_]+

其中[\p{L}\p{N}\p{M}_]大致是可识别Unicode的\w模式。 \p{L}匹配任何Uniciode字母，\p{M}匹配任何组合标记，\p{N}匹配任何Unicode数字。

请参见Go demo：

package main

import (
    "fmt"
    "regexp"
)

func main() {
    text := "#hello #ذوق #citroën"
    r := regexp.MustCompile(`\B#[\p{L}\p{N}\p{M}_]+`)
    res := r.FindAllString(text, -1)
    for _, element := range res {
        fmt.Println(element)
    }
}

输出：

#hello
#ذوق

使用text := "#ذوق"，the output is #ذوق。

请参见regex demo。

如何匹配多种语言

1 个答案: