Question

我正在转换解码电子邮件的Go程序。它目前运行iconv来进行实际解码，这当然有开销。我想使用golang.org/x/text/transform和golang.org/x/net/html/charset包来执行此操作。这是工作代码：

// cs is the charset that the email body is encoded with, pulled from
// the Content-Type declaration.
enc, name := charset.Lookup(cs)
if enc == nil {
    log.Fatalf("Can't find %s", cs)
}
// body is the email body we're converting to utf-8
r := transform.NewReader(strings.NewReader(body), enc.NewDecoder())

// result contains the converted-to-utf8 email body
result, err := ioutil.ReadAll(r)

除非遇到非法字节，否则效果很好，遗憾的是，在处理电子邮件时，这种情况并不常见。 ioutil.ReadAll（）返回错误和所有转换后的字节，直到出现问题。有没有办法告诉转换包忽略非法字节？现在，我们使用-c标志来iconv来做到这一点。我已经浏览了转换包的文档，我不知道它是否可能。

更新这是一个显示问题的测试程序（Go playground没有charset或转换包......）。原始文本取自实际的电子邮件。是的，它是英文的，是的，电子邮件中的字符集被设置为EUC-KR。我需要它忽略那个撇号。

package main

import (
    "io/ioutil"
    "log"
    "strings"

    "golang.org/x/net/html/charset"
    "golang.org/x/text/transform"
)

func main() {
    raw := `So, at 64 kBps, or kilobits per second, you’re getting 8 kilobytes a second.`
    enc, _ := charset.Lookup("euc-kr")
    r := transform.NewReader(strings.NewReader(raw), enc.NewDecoder())
    result, err := ioutil.ReadAll(r)
    if err != nil {
        log.Printf("ReadAll returned %s", err)
    }
    log.Printf("RESULT: '%s'", string(result))
}

Answer 1

enc.NewDecoder()会产生transform.Transformer。 NewDecoder()的文件说：

转换不属于该编码的源字节不会导致错误本身。无法转码的每个字节将在输出中通过替换符文的＆＃39; \ uFFFD＆＃39;的UTF-8编码表示。

这告诉我们读者在替换符文（也称为错误符文）上失败了。幸运的是，很容易将它们剥离出来。

golang.org/x/text/transform提供了两个可用于解决此问题的辅助函数。 Chain()采用一组变换器并将它们链接在一起。 RemoveFunc()接受一个函数并过滤掉它返回true的所有字节。

以下内容（未经测试）应该有效：

filter := transform.Chain(enc.NewDecoder(), transform.RemoveFunc(func (r rune) bool {
    return r == utf8.RuneError
}))
r := transform.NewReader(strings.NewReader(body), filter)

这应该在他们到达读者之前过滤掉所有符文错误。

Answer 2

这是我使用的解决方案。我没有使用Reader，而是手动分配目标缓冲区并直接调用Transform()函数。当raw := `So, at 64 kBps, or kilobits per second, you’re getting 8 kilobytes a second.` enc, _ := charset.Lookup("euc-kr") dst := make([]byte, len(raw)) d := enc.NewDecoder() var ( in int out int ) for in < len(raw) { // Do the transformation ndst, nsrc, err := d.Transform(dst[out:], []byte(raw[in:]), true) in += nsrc out += ndst if err == nil { // Completed transformation break } if err == transform.ErrShortDst { // Our output buffer is too small, so we need to grow it log.Printf("Short") t := make([]byte, (cap(dst)+1)*2) copy(t, dst) dst = t continue } // We're here because of at least one illegal character. Skip over the current rune // and try again. _, width := utf8.DecodeRuneInString(raw[in:]) in += width }出错时，我会检查一个简短的目标缓冲区，并在必要时重新分配。否则我跳过一个符文，假设它是非法角色。为了完整起见，我还应该检查一个简短的输入缓冲区，但在这个例子中我没有这样做。

{{1}}

使用Go解码文本时忽略非法字节？

2 个答案: