从URL获取内容会返回奇怪的字符

时间:2015-07-12 09:43:16

标签: java jsp utf-8 character-encoding servlet-3.0

我正在使用此方法从网址获取内容:

public String getContentFromURL(String stringUrl) throws UnsupportedEncodingException{
    String content = "";
    try {
        URL url = new URL(stringUrl);
        URLConnection urlc = url.openConnection();
        StringBuilder builder;
        try (BufferedReader buffer = new BufferedReader(new InputStreamReader(urlc.getInputStream(), "UTF-8"))) {
            builder = new StringBuilder();
            int byteRead;
            while ((byteRead = buffer.read()) != -1)
                builder.append((char) byteRead);
        }
        content=builder.toString();
        return content;
    } catch (MalformedURLException ex) {
        Logger.getLogger(Utils.class.getName()).log(Level.SEVERE, null, ex);
    } catch (IOException ex) {
        Logger.getLogger(Utils.class.getName()).log(Level.SEVERE, null, ex);
    }
    return content;
}

对于我获得的大多数文件都可以正常工作,除了那些来自其他语言的文字,例如:áí等...而不是我得到的那些字符。

  1. 我尝试过设置tomcat conector:

           <Connector port="8080" protocol="HTTP/1.1" URIEncoding="UTF-8"
           connectionTimeout="20000"
           redirectPort="8443" />
    
  2. 网页编码为:<%@page contentType="text/html" pageEncoding="UTF-8"%>

  3. 在servlet中添加了这个:

    response.setContentType("text/html;charset=UTF-8");
    response.setCharacterEncoding("UTF-8");
    request.setCharacterEncoding("UTF-8");
    
  4. 尝试将内容解码为GZIP。

  5. 上述选项均不适合我。

    这是我试图获取内容的网址:

    https://www.dropbox.com/s/kpbrx26bwhoa1rp/moment.js?raw=1
    

    它是Dropbox中的一个文件,即使浏览器也能正确读取,使用raw=1来直接获取文件的内容。在浏览器中,尝试搜索"[Môre om]以检查其是否正确显示。

    从包含奇怪字符的网址获取内容的正确方法是什么?

    PD:使用notepad ++我确定它的编码是utf-8 dropbox

    PD2:从连接获取字符编码返回null。

    更新:使用Google Guava库尝试此代码:

            String content = "";
            URLConnection url = new URL("https://www.dropbox.com/s/kpbrx26bwhoa1rp/moment.js?raw=1").openConnection();
    
            InputStream stream = url.getInputStream();
            content = CharStreams.toString(new InputStreamReader(stream, Charsets.UTF_8));
            Closeables.closeQuietly(stream);
    
            try (PrintStream outText = new PrintStream(new FileOutputStream("C:\\Users\\myUser\\Desktop\\test.txt"))) {
                outText.print(content);
                outText.close();
            }
    

    它适用于普通的java项目并且所有字符都正确显示但不在Java Web App项目中,这是我尝试此方法的索引:

    <%@page import="java.io.PrintStream"%>
    <%@page import="java.io.FileOutputStream"%>
    <%@page import="com.google.common.io.Closeables"%>
    <%@page import="java.io.InputStreamReader"%>
    <%@page import="com.google.common.io.CharStreams"%>
    <%@page import="com.google.common.base.Charsets"%>
    <%@page import="java.io.InputStream"%>
    <%@page import="java.net.URLConnection"%>
    <%@page import="java.net.URL"%>
    <%@page contentType="text/html" pageEncoding="UTF-8"%>
    <!DOCTYPE html>
    <html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>JSP Page</title>
    </head>
    <body>
        <%
            response.setContentType("text/html;charset=UTF-8");
            response.setCharacterEncoding("UTF-8");
            request.setCharacterEncoding("UTF-8");
    
            String content = "";
            URLConnection url = new URL("https://www.dropbox.com/s/kpbrx26bwhoa1rp/moment.js?raw=1").openConnection();
    
            InputStream stream = url.getInputStream();
            content = CharStreams.toString(new InputStreamReader(stream, Charsets.UTF_8));
            Closeables.closeQuietly(stream);
    
            try (PrintStream outText = new PrintStream(new FileOutputStream("C:\\Users\\myUser\\Desktop\\test.txt"))) {
                outText.print(content);
                outText.close();
            }
        %>
    </body>
    </html>
    

    当我查看创建的文件时,这些仍会出现。 为什么相同代码的行为与独立应用程序的行为不同?

    已解决:替换

    try (PrintStream outText = new PrintStream(new FileOutputStream("C:\\Users\\myUser\\Desktop\\test.txt"))) {
                outText.print(content);
                outText.close();
            }
    

    Writer outText = new BufferedWriter(new OutputStreamWriter( new FileOutputStream("C:\\Users\\myUser\\Desktop\\testRaw.txt"), "UTF-8"));
            try {
                outText.write(content);
            } finally {
                outText.close();
            }
    

2 个答案:

答案 0 :(得分:2)

我把你的代码变成了一个最小的例子,同时取出了奇怪的位(BufferedReader的意思是避免用char读取char)。我得到了非常好的UTF8。尝试运行它,重定向到文件并使用支持Unicode的文本编辑器检查输出。

import java.util.*;
import java.net.*;
import java.io.*;

public class UTF8Test {

public static void main(String[] args) throws Exception {
        //System.out.println(getContentFromURL("http://www.columbia.edu/~kermit/utf8.html"));
        System.out.println(getContentFromURL("https://www.dropbox.com/s/kpbrx26bwhoa1rp/moment.js?raw=1"));
    }

    public static String getContentFromURL(String stringUrl) throws Exception {
        URL url = new URL(stringUrl);
        URLConnection urlc = url.openConnection();
        StringBuilder builder = new StringBuilder();
        BufferedReader breader = new BufferedReader(new InputStreamReader(urlc.getInputStream(), "UTF-8"));
        String line = "";
        while ((line = breader.readLine()) != null) {
            builder.append(line);
        }

        return builder.toString();
    }
}

答案 1 :(得分:2)

您使用默认编码编写文本,最好将其存储为UTF-8。

try (PrintStream outText = new PrintStream(
        new File("C:\\Users\\myUser\\Desktop\\test.txt"), "UTF-8")) {
    if (!content.startsWith("\uFEFF")) {
        outText.print("\uFEFF");
    }
    outText.print(content);
} // Calls outText.close()

这会在开头写入带有BOM char '\uFEFF'的文本。这是一个看不见的零宽度空间,Windows可用于检测UTF-8。实际上这是一种不好的做法,但允许在NotePad中编辑文本。

错误是某些Unicode字符无法映射到默认编码。

暂且不说:您假设URL中的文本是UTF-8。通常,最好通过URLConnection标头检查它。

String encoding = urlc.getContentEncoding();
if (encoding == null) {
    encoding = "UTF-8";
} else if (encoding.equalsIgnoreCase("ISO-8859-1")) { // Latin-1
    encoding = "Windows-1252"; // Windows Latin-1
}

Latin-1补丁可能很有用,因为任何操作系统上的所有浏览器都将ISO-8859-1解释为Windows-1252;现在正式为HTML5。