我想在Go
我的python
计划中进行翻译,将unicode
字符串转换为UCS-2
HEX
字符串。
在python中,它非常简单:
u"Bien joué".encode('utf-16-be').encode('hex')
-> 004200690065006e0020006a006f007500e9
我是Go
的初学者,我找到的最简单的方法是:
package main
import (
"fmt"
"strings"
)
func main() {
str := "Bien joué"
fmt.Printf("str: %s\n", str)
ucs2HexArray := []rune(str)
s := fmt.Sprintf("%U", ucs2HexArray)
a := strings.Replace(s, "U+", "", -1)
b := strings.Replace(a, "[", "", -1)
c := strings.Replace(b, "]", "", -1)
d := strings.Replace(c, " ", "", -1)
fmt.Printf("->: %s", d)
}
str: Bien joué
->: 004200690065006E0020006A006F007500E9
Program exited.
我认为它显然效率不高。怎么能改进呢?
谢谢
答案 0 :(得分:3)
将此转换功能化,然后您可以轻松改进转换算法。例如,
package main
import (
"fmt"
"strings"
"unicode/utf16"
)
func hexUTF16FromString(s string) string {
hex := fmt.Sprintf("%04x", utf16.Encode([]rune(s)))
return strings.Replace(hex[1:len(hex)-1], " ", "", -1)
}
func main() {
str := "Bien joué"
fmt.Println(str)
hex := hexUTF16FromString(str)
fmt.Println(hex)
}
输出:
Bien joué
004200690065006e0020006a006f007500e9
注意:
你说“将unicode字符串转换为UCS-2字符串”,但你的Python示例使用UTF-16:
u"Bien joué".encode('utf-16-be').encode('hex')
Q: What is the difference between UCS-2 and UTF-16?
答:UCS-2是过时的术语,指的是Unicode 在代理代码点和之前执行到Unicode 1.1 UTF-16被添加到该标准的2.0版本中。这个词现在应该 要避免。
UCS-2没有描述与UTF-16不同的数据格式,因为 两者都使用完全相同的16位代码单元表示。然而, UCS-2不解释代理代码点,因此不能 习惯于一致地代表补充字符。
有时在过去,一个实现被标记为“UCS-2” 表示它不支持补充字符,也不支持 将代理代码点对解释为字符。这样的 实现不会处理字符属性的处理, 补充字符的代码点边界,整理等。
答案 1 :(得分:3)
对于除了简单的短输入之外的任何事情(甚至可能是那时),我会使用golang.org/x/text/encoding/unicode
包转换为UTF-16(如@peterSo和@JimB指出,与过时的UCS略有不同 - 2)。
使用此(以及unicode/utf16
包)的优势(超过golang.org/x/text/transform
)是您获得BOM支持,大或小端,以及您可以编码/解码短字符串或字节,但您也可以将其作为过滤器应用于io.Reader
或io.Writer
,以便在处理数据时转换数据,而不是预先处理所有数据(例如,对于您不需要的大量数据流把它全部记在内存中。)
E.g:
package main
import (
"bytes"
"fmt"
"io"
"io/ioutil"
"log"
"strings"
"golang.org/x/text/encoding/unicode"
"golang.org/x/text/transform"
)
const input = "Bien joué"
func main() {
// Get a `transform.Transformer` for encoding.
e := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
t := e.NewEncoder()
// For decoding, allows a Byte Order Mark at the start to
// switch to corresponding Unicode decoding (UTF-8, UTF-16BE, or UTF-16LE)
// otherwise we use `e` (UTF-16BE without BOM):
t2 := unicode.BOMOverride(e.NewDecoder())
_ = t2 // we don't show/use this
// If you have a string:
str := input
outstr, n, err := transform.String(t, str)
if err != nil {
log.Fatal(err)
}
fmt.Printf("string: n=%d, bytes=%02x\n", n, []byte(outstr))
// If you have a []byte:
b := []byte(input)
outbytes, n, err := transform.Bytes(t, b)
if err != nil {
log.Fatal(err)
}
fmt.Printf("bytes: n=%d, bytes=%02x\n", n, outbytes)
// If you have an io.Reader for the input:
ir := strings.NewReader(input)
r := transform.NewReader(ir, t)
// Now just read from r as you normal would and the encoding will
// happen as you read, good for large sources to avoid pre-encoding
// everything. Here we'll just read it all in one go though which negates
// that benefit (normally avoid ioutil.ReadAll).
outbytes, err = ioutil.ReadAll(r)
if err != nil {
log.Fatal(err)
}
fmt.Printf("reader: len=%d, bytes=%02x\n", len(outbytes), outbytes)
// If you have an io.Writer for the output:
var buf bytes.Buffer
w := transform.NewWriter(&buf, t)
_, err = fmt.Fprint(w, input) // or io.Copy from an io.Reader, or whatever
if err != nil {
log.Fatal(err)
}
fmt.Printf("writer: len=%d, bytes=%02x\n", buf.Len(), buf.Bytes())
}
// Whichever of these you need you could of
// course put in a single simple function. E.g.:
// NewUTF16BEWriter returns a new writer that wraps w
// by transforming the bytes written into UTF-16-BE.
func NewUTF16BEWriter(w io.Writer) io.Writer {
e := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
return transform.NewWriter(w, e.NewEncoder())
}
// ToUTFBE converts UTF8 `b` into UTF-16-BE.
func ToUTF16BE(b []byte) ([]byte, error) {
e := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
out, _, err := transform.Bytes(e.NewEncoder(), b)
return out, err
}
给出:
string: n=10, bytes=004200690065006e0020006a006f007500e9
bytes: n=10, bytes=004200690065006e0020006a006f007500e9
reader: len=18, bytes=004200690065006e0020006a006f007500e9
writer: len=18, bytes=004200690065006e0020006a006f007500e9
答案 2 :(得分:-1)
标准库具有内置的utf16.Encode()
(https://golang.org/pkg/unicode/utf16/#Encode)功能。