使用Scala问题

时间:2017-07-19 18:45:48

标签: json scala unicode character-encoding non-ascii-characters

我有一个多行JSON文件,其中的记录包含编码为十六进制的特殊字符。以下是单个JSON记录的示例:

{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}

此记录应为{"value":"ıarines Bintıç Ramuçlar"},例如'''字符被替换为相应的十六进制\ x22,其他特殊Unicode字符被替换为一个或两个十六进制(例如\ xC3 \ xA7编码ç等)。

我需要将类似的字符串转换为Scala中的常规Unicode字符串,因此在打印时它会生成{"value":"ıarines Bintıç Ramuçlar"}而不使用十六进制。

在Python中,我可以使用一行代码轻松解码这些记录:

>>> a = "{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}"
>>> a.decode("utf-8")
u'{"value":"\u0131arines Bint\u0131\xe7 Ramu\xe7lar"}'
>>> print a.decode("utf-8")
{"value":"ıarines Bintıç Ramuçlar"}

但是在Scala中我无法找到解码它的方法。我没有尝试像这样转换它:

scala> val a = """{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}"""
scala> print(new String(a.getBytes(), "UTF-8"))
{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}

我也试过了URLDecoder,就像我在解决方案中找到的类似问题一样(但是有URL):

scala> val a = """{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}"""
scala> print(java.net.URLDecoder.decode(a.replace("\\x", "%"), "UTF-8"))
{"value":"ıarines Bintıç Ramuçlar"}

它为此示例生成了所需的结果,但对于通用文本字段似乎不安全,因为它旨在使用URL并且需要将字符串中的所有\x替换为%

Scala有更好的方法来处理这个问题吗?

我是Scala的新手,感谢任何帮助

更新: 我使用javax.xml.bind.DatatypeConverter.parseHexBinary制作了自定义解决方案。它现在有效,但看起来很麻烦而且一点也不优雅。我认为应该有一种更简单的方法来做到这一点。

以下是代码:

import javax.xml.bind.DatatypeConverter
import scala.annotation.tailrec
import scala.util.matching.Regex

def decodeHexChars(string: String): String = {
  val regexHex: Regex = """\A\\[xX]([0-9a-fA-F]{1,2})(.*)""".r
  def purgeBuffer(buffer: String, acc: List[Char]): List[Char] = {
    if (buffer.isEmpty) acc
    else new String(DatatypeConverter.parseHexBinary(buffer)).reverse.toList ::: acc
  }
  @tailrec
  def traverse(s: String, acc: List[Char], buffer: String): String = s match {
    case "" =>
      val accUpdated = purgeBuffer(buffer, acc)
      accUpdated.foldRight("")((str, b) => b + str)
    case regexHex(chars, suffix) =>
      traverse(suffix, acc, buffer + chars)
    case _ =>
      val accUpdated = purgeBuffer(buffer, acc)
      traverse(s.tail, s.head :: accUpdated, "")
  }
  traverse(string, Nil, "")
}

2 个答案:

答案 0 :(得分:0)

问题是编码确实特定于python(我认为)。这样的事情可能有用:

val s = """{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}"""

"""\\x([A-F0-9]{2})""".r.replaceAllIn(s, (x: Regex.Match) => 
  new String(BigInt(x.group(1), 16).toByteArray, "UTF-8")
)

答案 1 :(得分:0)

Each reactor.stop() encodes one byte, like multiprocessing encodes threading and \x?? encodes \x22. But in UTF-8 some characters are encoded using multiple bytes, so you need to transform " to \x5C symbol and so on.

\ is really nice, but it might eat your slashes. So, if you don't use groups (like \xC4\xB1) in a replaced string, ı is a recommended way to escape replaceAllIn and \1 symbols.

quoteReplacement

P.S. Does anyone know the difference between \ and $?