我有一个收集一些HTML数据的程序。
public class Uni_Extract {
public static void main(String[] args) throws Exception {
System.out.println("Started");
String csvFile = "C://Users/Kennedy/Desktop/university.csv";
FileWriter writer = new FileWriter(csvFile);
for (int i=2; i<=2; i++){
String url = "http://www.4icu.org/reviews/index"+i+".htm";
Document doc = Jsoup.connect(url).userAgent("Mozilla").get();
Elements cells = doc.select("td.i");
Iterator<Element> iterator = cells.iterator();
while (iterator.hasNext()) {
Element cell = iterator.next();
String university = Jsoup.parse((cell.select("a").text())).text();
university = StringEscapeUtils.escapeHtml(university);
String country = cell.nextElementSibling().select("img").attr("alt");
System.out.printf("country : %s, university : %s %n", country, university);
}
}
writer.flush();
writer.close();
}
}
但是,我的程序遇到一些特殊的字符时,会返回原始的HTML代码。我该如何解析它们?
例如,它将返回包含“ö”作为特殊字符的AzerbaycanDövletPedaqojiUniversiteti?我怎么能解决它和其他类似的情况?
答案 0 :(得分:1)
稍微简化一下代码并删除对escapeHtml
的调用后,一切似乎都能正常工作。这是我的代码和相关的输出行:
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
import java.io.*;
import java.util.*;
public class Test
{
public static void main(String[] args) throws IOException {
System.out.println("Started");
String url = "http://www.4icu.org/reviews/index2.htm";
Document doc = Jsoup.connect(url).userAgent("Mozilla").get();
Elements cells = doc.select("td.i");
Iterator<Element> iterator = cells.iterator();
while (iterator.hasNext()) {
Element cell = iterator.next();
String university = Jsoup.parse((cell.select("a").text())).text();
String country = cell.nextElementSibling().select("img").attr("alt");
System.out.printf("country : %s, university : %s %n", country, university);
}
}
}
输出:
...
country : Azerbaijan, university : Azerbaycan Dövlet Aqrar Universiteti
...