Question

以下是我用来抓取网站并将文本/内容保存在文本文件中的功能。但由于该网站是中文的，我从网站获得的数据不受支持。我做了一些研究，发现String很可能有UTF-16编码，理论上应该支持中文字符。但在这种情况下，它没有。我甚至尝试用Java中的print语句打印出一些中文句子，一切都运行得很完美。我只是不明白为什么它没有字符串对象不支持中文字符。有人可以帮我吗？

    void contentGetter() throws IOException {
//      the string is kind of messed up so all i see in the file is question marks
//        Writer writer = new OutputStreamWriter(
//                new FileOutputStream("a.txt"), "UTF-8");
        ArrayList content = new ArrayList();
//        for (int i = 0; i<urlList.size(); i++){
            URL url;
            InputStream is = null;
            BufferedReader br;
            String line;
            try {
                // the url used here is http://ds.eywedu.com/jinyong/tlbb/mydoc001.htm feel free to try it
                url = new URL((String)urlList.get(0));
                is = url.openStream();
                br = new BufferedReader(new InputStreamReader(is));

                while ((line = br.readLine()) != null) {
                    content.add(line);
                }

                boolean title = false;
                for (int m = 0; m<content.size(); m++){
                    String contents = (String) content.get(m);
                    if (contents.contains("script type=\"text") && !title){
                        title = true;
//                        writer.write(contents.substring(contents.indexOf("\"4\"")+4,contents.indexOf("</font>")));
                        titles.add(
                                contents.substring(contents.indexOf("\"4\"")+4,contents.indexOf("</font>")));
                        for (int j = 0; j<=+1; j++){
                            content.remove(0);
                        }
                    }else if (contents.contains("script type=\"text")){
                        int loc = contents.indexOf("</DIV>");
                        int fLoc = contents.indexOf("<BR>");
                        if (loc != -1 && fLoc != -1){
                            filtered.add(contents.substring(loc+6, fLoc));
//                            writer.write(contents.substring(loc+6, fLoc));
                        }
                    } else if (contents.contains("<BR>")){
//                        writer.write(contents.substring(0,contents.indexOf("<BR>")));
                        filtered.add(contents.substring(0,contents.indexOf("<BR>")));

                    }

                }

            } catch (MalformedURLException mue) {
                mue.printStackTrace();
            } catch (IOException ioe) {
                ioe.printStackTrace();
            } finally {
                try {
                    if (is != null) is.close();
                } catch (IOException ioe) {
                    //exception
                }
            }
        System.out.println(filtered);
//        writer.close();
        }
//    }
}

Answer 1

br = new BufferedReader(new InputStreamReader(is));

您正在使用系统的charset / encoding打开InputStreamReader。这很可能不是您收到的HTML页面的字符集/编码。你应该（总是，不仅在这里）使用InputStreamReader的构造函数来允许你明确指定字符集：

url = new URL((String)urlList.get(0));
URLConnection uc = url.openConnection();
is = uc.getInputStream();
br = new BufferedReader(new InputStreamReader(is, uc.getContentEncoding()));

这可能会解决您的问题，但如果中文字符被指定为字符实体引用（如）或数字实体引用（如&#12345），您仍然需要进行一些解码才能完成。

Java字符串中不支持的字符（中文字符）

1 个答案: