Question

我遇到了一些从Spray HttpResponse中提取HTML字符串并需要使用内容类型的代码的问题：

def getHtmlString(response: HttpResponse): String = {
  response.header[HttpHeaders.`Content-Type`] match {
    case Some(t) => HttpRequester.unmarshaller(response)
    case None => // more logic to detect the content-type using Tika.
  }
}

其中HttpRequester对象将unmarshaller声明为：

object HttpRequester extends RequestBuilding with ResponseTransformation {
  import spray.httpx.encoding.{ Gzip, Deflate }
  val unmarshaller = decode(Gzip) ~> decode(Deflate) ~> unmarshal[String]
  ...
}

它适用于大多数HTML我提取Content-Type标头声明编码的位置，但有时（例如this Vice page）我得到一个Content-Type: text/html的页面在<meta charset="utf-8">标记内的响应<head>中定义编码。我知道that's perfectly valid behaviour，但是我从unmarshaller获得的结果字符串不是UTF-8编码的。

Spray不会随便做这件事并不奇怪（它涉及通过身体阅读，这可能比Spray想要的更多参与），而且我假设我＆＃39 ; ll必须编写一个自定义Unmarshaller[String]来读取正文，开始解析HTML并查看<head>是否包含<meta charset="whatever">标记，然后对该字符串进行重新编码。只是想检查Spray是否有能力为我们做这件事。

喷涂不检测<meta />内容编码

0 个答案: