Question

我试图在 Go 中将字符串 PyrÉnÉes 转换为 Pyrenees，但我得到的是 Pyr�n�es 作为输出。

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    s := "PyrÉnÉes"
    t := make([]byte, utf8.RuneCountInString(s))
    i := 0
    for _, r := range s {
        t[i] = byte(r)
        i++
    }
    fmt.Print(string(t))
}

Run Here

有人能告诉我我做错了什么吗？

Answer 1

Go 中的类型 rune 是 int32 的类型别名，表示 Unicode 码位。 byte 是 uint8 的类型别名。

当你的范围超过 string 时，你会得到符文，它是 Unicode 代码点，而不是 utf8 编码。通过将 rune 转换为 byte，您可以有效地将 Unicode 代码点添加到您的字节数组中。

fmt.Printf("%x", 'É') // 0xc9 whereas in utf8 it is [0xC3 0x89]

因此，当您将字节切片转换回字符串时，您会得到一个不再有效的 utf8 字节序列，并且通过打印它，它看起来已损坏。

对于像您提供的程序这样的简单程序，要正确地将 É 转换为 e，您可以简单地检查所覆盖的符文。

for _, r := range s {
        if r == 'É' {
            t[i] = byte('e') 
        } else {
            t[i] = byte(r)
        }
        i++
    }

但是请注意，byte('e') 的工作原理仅仅是因为 e 的 Unicode 代码点等于其 utf8 表示。因此，这不是进行转换的可靠方法。

有关详细信息，请参阅此问题的 duplicate target。

Golang 将 UTF-8 字符串转换为 ASCII

1 个答案: