我有一个代表网址的字符串,我需要获取其HTML源代码。 问题是,我无法找到一种方法来获得正确的编码(像àèìò这样的字母不能正确读取,只是收到“??”)。
最好的方法是什么?我遇到了很多解决方案,但显然没有人在工作。
这是我的代码
private String getHtml(String url, String idSession) throws IOException
{
URL urlToCall = null;
String html = "";
try
{
urlToCall = new URL(url);
}
catch (Exception e)
{
e.printStackTrace();
return "";
}
HttpURLConnection conn;
conn = (HttpURLConnection) urlToCall.openConnection();
conn.setRequestProperty("cookie", "JSESSIONID=" + idSession);
conn.setDoOutput(false);
conn.setReadTimeout(200*1000);
conn.setConnectTimeout(200*1000);
ByteArrayOutputStream output = new ByteArrayOutputStream();
InputStream openStream = conn.getInputStream();
byte[] buffer = new byte[ 1024 ];
int size = 0;
while( (size = openStream.read( buffer ) ) != -1 ) {
output.write( buffer, 0, size );
}
html = output.toString("utf-8");
return html;
}
答案 0 :(得分:0)
尝试 JSOUP
String url = "http://www.hamzaalayed.com/";
Document document = Jsoup.parse(new URL(url).openStream(), "utf-8", url);
Element paragraph = document.select("p").first();
for (Node node : paragraph.childNodes()) {
if (node instanceof TextNode) {
System.out.println(((TextNode) node).text().trim());
}
}