如何正确读取hyper :: client :: Request中的字节序列并将其作为UTF-8字符串打印到控制台?

时间:2016-07-22 18:22:09

标签: utf-8 rust bytebuffer

我正在探索Rust并尝试发出一个简单的HTTP请求(使用超级包)并将响应主体打印到控制台。响应实现了std::io::Read。阅读各种文档资源和基本教程,我已经得到了以下代码,我编译和编写了这些代码。使用RUST_BACKTRACE=1 cargo run执行:

use hyper::client::Client;
use std::io::Read;

pub fn print_html(url: &str) {
    let client = Client::new();
    let req = client.get(url).send();

    match req {
        Ok(mut res) => {
            println!("{}", res.status);

            let mut body = String::new();

            match res.read_to_string(&mut body) {
                Ok(body) => println!("{:?}", body),
                Err(why) => panic!("String conversion failure: {:?}", why)
            }
        },
        Err(why) => panic!("{:?}", why)
    }
}

预期:

HTTP服务器提供的正文良好的,人类可读的HTML内容将打印到控制台。

实际:

200 OK
thread '<main>' panicked at 'String conversion failure: Error { repr: Custom(Custom { kind: InvalidData, error: StringError("stream did not contain valid UTF-8") }) }', src/printer.rs:16
stack backtrace:
   1:        0x109e1faeb - std::sys::backtrace::tracing::imp::write::h3800f45f421043b8
   2:        0x109e21565 - std::panicking::default_hook::_$u7b$$u7b$closure$u7d$$u7d$::h0ef6c8db532f55dc
   3:        0x109e2119e - std::panicking::default_hook::hf3839060ccbb8764
   4:        0x109e177f7 - std::panicking::rust_panic_with_hook::h5dd7da6bb3d06020
   5:        0x109e21b26 - std::panicking::begin_panic::h9bf160aee246b9f6
   6:        0x109e18248 - std::panicking::begin_panic_fmt::haf08a9a70a097ee1
   7:        0x109d54378 - libplayground::printer::print_html::hff00c339aa28fde4
   8:        0x109d53d76 - playground::main::h0b7387c23270ba52
   9:        0x109e20d8d - std::panicking::try::call::hbbf4746cba890ca7
  10:        0x109e23fcb - __rust_try
  11:        0x109e23f65 - __rust_maybe_catch_panic
  12:        0x109e20bb1 - std::rt::lang_start::hbcefdc316c2fbd45
  13:        0x109d53da9 - main
error: Process didn't exit successfully: `target/debug/playground` (exit code: 101)

思想

由于我从服务器收到200 OK,我相信我收到了来自服务器的有效回复(我也可以通过用更熟悉的编程语言执行相同的请求来凭经验证明这一点)。因此,错误必须由我错误地将字节序列转换为UTF-8字符串引起。

替代

我还尝试了以下解决方案,这使我能够将字节作为一系列十六进制字符串打印到控制台,但我知道这是根本错误的,因为UTF-8字符可以有1- 4字节。因此,在此示例中尝试将单个字节转换为UTF-8字符仅适用于UTF-8字符的非常有限(255,确切)子集。

use hyper::client::Client;
use std::io::Read;

pub fn print_html(url: &str) {
    let client = Client::new();
    let req = client.get(url).send();

    match req {
        Ok(res) => {
            println!("{}", res.status);

            for byte in res.bytes() {
                print!("{:x}", byte.unwrap());
            }
        },
        Err(why) => panic!("{:?}", why)
    }
}

2 个答案:

答案 0 :(得分:4)

我们可以使用iconv命令确认从http://www.google.com返回的数据无效UTF-8:

$ wget http://google.com -O page.html
$ iconv -f utf-8 page.html > /dev/null
iconv: illegal input sequence at position 5591

对于其他一些网址(例如http://www.reddit.com),代码运行正常。

如果我们假设大部分数据都是有效的UTF-8,我们可以使用String::from_utf8_lossy来解决问题:

pub fn print_html(url: &str) {
    let client = Client::new();
    let req = client.get(url).send();

    match req {
        Ok(mut res) => {
            println!("{}", res.status);

            let mut body = Vec::new();

            match res.read_to_end(&mut body) {
                Ok(_) => println!("{:?}", String::from_utf8_lossy(&*body)),
                Err(why) => panic!("String conversion failure: {:?}", why),
            }
        }
        Err(why) => panic!("{:?}", why),
    }
}

请注意Read::read_to_stringRead::read_to_end返回Ok成功时读取的字节数,而不是读取数据。

答案 1 :(得分:4)

如果您实际查看Google返回的标头:

HTTP/1.1 200 OK
Date: Fri, 22 Jul 2016 20:45:54 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
P3P: CP="This is not a P3P policy! See https://www.google.com/support/accounts/answer/151657?hl=en for more info."
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Set-Cookie: NID=82=YwAD4Rj09u6gUA8OtQH73BUz6UlNdeRc9Z_iGjyaDqFdRGMdslypu1zsSDWQ4xRJFyEn9-UtR7U6G7HKehoyxvy9HItnDlg8iLsxzlhNcg01luW3_-HWs3l9S3dmHIVh; expires=Sat, 21-Jan-2017 20:45:54 GMT; path=/; domain=.google.ca; HttpOnly
Alternate-Protocol: 443:quic
Alt-Svc: quic=":443"; ma=2592000; v="36,35,34,33,32,31,30,29,28,27,26,25"
Accept-Ranges: none
Vary: Accept-Encoding
Transfer-Encoding: chunked

你可以看到

  

Content-Type:text / html;的字符集= ISO-8859-1

此外

  

因此,错误必须由我错误地将字节序列转换为UTF-8字符串引起。

没有转换为UTF-8发生read_to_string只是确保数据 UTF-8。

简单地说,假设任意HTML页面以UTF-8编码是完全错误的。充其量,您必须解析标头以查找编码,然后转换数据。这很复杂,因为there's no real definition for what encoding the headers are in

找到正确的编码后,如果结果是文本,您可以使用encoding这样的包来正确地将结果转换为UTF-8!请记住,HTTP可以返回二进制文件,如图像。