我正在使用BufferedReader从HttpURLConnection读取InputStream的行。一切都按预期工作,除了一条给我带来麻烦的线路。
具体行如下:
<span class="dbox-italic">Chemistry.</span> <ol class="def-sub-list">
我使用的代码如下:
URL searchUrl;
ArrayList<String> definitions = new ArrayList<>();
try {
searchUrl = new URL("http://dictionary.com/browse/" + word);
HttpURLConnection connection = (HttpURLConnection) searchUrl.openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Linux; Android 5.1.1; Vodafone Smart ultra 6"
+ " Build/LMY47V) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.91"
+ " Mobile Safari/537.36");
BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String line;
while((line = in.readLine()) != null){
if(line.contains("class=\"def-content\"")){
line = in.readLine();
if(line.contains("</div>")){
int endIndex = line.indexOf("</div>") - 11;
String definition = line.trim().substring(0, endIndex);
definitions.add(definition);
} else {
if(line.contains("class=\"def-sub-list\"")){
String sublist = line;
while(!line.contains("</ol>")){
line = in.readLine();
sublist += line;
}
sublist = sublist.replaceAll("<li>", "").replaceAll("</ol>", "").trim();
String[] sublistDefinitions = sublist.split("</li>");
for(String definition : sublistDefinitions){
definition.trim();
definitions.add(definition);
}
}
}
}
}
System.out.print("");
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
问题的代码行是这个:
String sublist = line;
这行代码适用于这行HTML(我必须使用截图,因为StackOverflow不会让我只引用HTML而不删除标记,这是有用的信息):
但不是这一行:
<span class="dbox-italic">Chemistry.</span> <ol class="def-sub-list">
我尝试删除“&gt;”最后,但就好像Java拒绝以任何方式操纵那一行。无论在修剪空白后我从多少字符中取出,该线都保持完全如上图所示(减去空白)。可能是问题行以一个不可见的特殊字符(如换行符)结尾,而Java不喜欢它?