Question

我正在下载html网页的源代码并将其写回txt文件。终端上的输出看起来是正确的，但在写入文件并使用gedit读取内容时，内容看起来像这样：

＆lt; ^ @！^ @ D ^ @ O ^ @ C ^ @ T ^ @ Y ^ @ P ^ @ E ^ @ ^ @ h ^ @ t ^ @ m ^ @ l ^ @ ^ @ P ^ @ U ^ @ B ^ @ L ^ @ I ^ @ C ^ @ ^ @“^ @ - ^ @ / ^ @ / ^ @ W ^ @ 3 ^ @ C ^ @ / ^ @ ^ ^ @ D ^ @ T ^ @ D ^ @ ^ @ X ^ @ H ^ @ T ^ @ M ^ @ L ^ @ ^ @ 1 ^ @。^ @ 0 ^ @ ^ @ T ^ @ r ^ @ a ^ @ n ^ @ s ^ @ i ^ @ T ^ @ I ^ @ o ^ @ N ^ @一个^ @ L ^

我正在使用BufferedReader逐行读取文件：

URL oracle = new URL("http://example.com");
BufferedReader in = new BufferedReader(
                    new InputStreamReader(oracle.openStream()));

while ((inputLine = in.readLine()) != null)
    {
        // appending to get the complete html string 
    }

然后我使用PrintWriter编写内容。

PrintWriter pout = new PrintWriter("output.txt");
pout.write(html); // here html is the appended html string
pout.close();

有人可以帮我解决这个问题。

Answer 1

在读取URL时，您需要将编码设置为UTF-8，在写回时，您应该再次提到您的编码是UTF-8。默认编码可能是您系统的编码，可能无法很好地处理unicode字符。 InputStream和Outputstream都支持编码作为参数。因此，您可能希望将PrintWriter替换为OutputStream

Answer 2

我建议使用apache IOUitls

org.apache.commons.io.IOUtils.copy(connection.getInputStream(), new FileOutputStream(file));

URL url = new URL("http://example.com"");
    HttpURLConnection connection = (HttpURLConnection) url.openConnection();
    connection.setRequestMethod("GET");
    String contentType = connection.getContentType();
    System.out.println("content-type: " + contentType);
    IOUtils.copy(connection.getInputStream(), new FileOutputStream("/folder/fileName.html"));

Answer 3

^@是字节0，因此您正在使用UTF-16进行阅读，这似乎是您的系统默认编码。

指定编码。标题行的编码是决定性的。如果未指定，请使用默认的Latin-1。

URL oracle = new URL("http://example.com");
URLConnection con = oracle.openConnection();
String encoding = con.getContentEncoding();
if (encoding == 0 || encoding.equalsIgnoreCase("ISO-8859-1")) {
    encoding = "Windows-1252"; // Default is Latin-1, as Windows Latin-1
}
con.connect();
BufferedReader in = new BufferedReader(
                    new InputStreamReader(con.getInputStream(), encoding));

但是你可以考虑一个元声明。

将HTML写入txt文件时编码错误

3 个答案: