Question

我需要熟练的地鼠提供一些建议。

我正在解析一些句子中的单词，我的\w+正则表达式可以正常使用拉丁字符。然而，它完全失败了一些西里尔字符。

以下是一个示例应用：

package main

import (
    "fmt"
    "regexp"
)

func get_words_from(text string) []string {
    words := regexp.MustCompile("\\w+")
    return words.FindAllString(text, -1)
}

func main() {
    text := "One, two three!"
    text2 := "Раз, два три!"
    text3 := "Jedna, dva tři čtyři pět!"
    fmt.Println(get_words_from(text))
    fmt.Println(get_words_from(text2))
    fmt.Println(get_words_from(text3))
}

它产生以下结果：

 [One two three]
 []
 [Jedna dva t i ty i p t]

它返回俄语的空值和捷克语的额外音节。我不知道如何解决这个问题。有人可以给我一些建议吗？

或许有一种更好的方法可以将一个句子分成没有标点符号的单词？

Answer 1

\w速记类仅匹配GO regex中的ASCII字母，因此，您需要Unicode字符类\p{L}。

\w个字符（== [0-9A-Za-z_]）

使用字符类来包含数字和下划线：

    regexp.MustCompile("[\\p{L}\\d_]+")

demo的输出：

[One two three]
[Раз два три]
[Jedna dva tři čtyři pět]

Golang正则表达式与非拉丁字符

1 个答案: