BufferedReader特定行未附加到String

时间:2017-02-24 23:17:25

标签: java android html string bufferedreader

我正在使用BufferedReader从HttpURLConnection读取InputStream的行。一切都按预期工作,除了一条给我带来麻烦的线路。

具体行如下:

    <span class="dbox-italic">Chemistry.</span>            <ol class="def-sub-list">

我使用的代码如下:

        URL searchUrl;
        ArrayList<String> definitions = new ArrayList<>();

        try {
            searchUrl = new URL("http://dictionary.com/browse/" + word);
            HttpURLConnection connection = (HttpURLConnection) searchUrl.openConnection();
            connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Linux; Android 5.1.1; Vodafone Smart ultra 6"
                    + " Build/LMY47V) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.91"
                    + " Mobile Safari/537.36");

            BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
            String line;

            while((line = in.readLine()) != null){
                if(line.contains("class=\"def-content\"")){
                    line = in.readLine();

                    if(line.contains("</div>")){
                        int endIndex = line.indexOf("</div>") - 11;
                        String definition = line.trim().substring(0, endIndex);
                        definitions.add(definition);
                    } else {
                        if(line.contains("class=\"def-sub-list\"")){
                            String sublist = line;

                            while(!line.contains("</ol>")){
                                line = in.readLine();
                                sublist += line;
                            }

                            sublist = sublist.replaceAll("<li>", "").replaceAll("</ol>", "").trim();

                            String[] sublistDefinitions = sublist.split("</li>");

                            for(String definition : sublistDefinitions){
                                definition.trim();
                                definitions.add(definition);
                            }

                        }

                    }


                }
            }

            System.out.print("");

        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

问题的代码行是这个:

String sublist = line;

这行代码适用于这行HTML(我必须使用截图,因为StackOverflow不会让我只引用HTML而不删除标记,这是有用的信息):

但不是这一行:

    <span class="dbox-italic">Chemistry.</span>            <ol class="def-sub-list">

我尝试删除“&gt;”最后,但就好像Java拒绝以任何方式操纵那一行。无论在修剪空白后我从多少字符中取出,该线都保持完全如上图所示(减去空白)。可能是问题行以一个不可见的特殊字符(如换行符)结尾,而Java不喜欢它?

0 个答案:

没有答案