我正在尝试从网页上阅读源代码。我的java代码是
import java.net.*;
import java.io.*;
import java.util.*;
import javax.swing.JOptionPane;
class Testing{
public static void Connect() throws Exception{
URL url = new URL("http://excite.com/education");
URLConnection spoof = url.openConnection();
spoof.setRequestProperty( "User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)" );
BufferedReader in = new BufferedReader(new InputStreamReader(spoof.getInputStream()));
String strLine = "";
while ((strLine = in.readLine()) != null){
System.out.println(strLine);
}
System.out.println("End of page.");
}
public static void main(String[] args){
try{
Connect();
}catch(Exception e){
}
}
当我编译并运行此代码时,它会提供以下输出:
I�%&/m�{J�J��t�
$ @ IG#)* EVE F的@흼{{; N'\ !?fdlJɞ〜|“〜$}> 47N +ӲMNJ′ tZfMGjR 9基!?JgEGe [ⳏW 8 U
| 8
ho0“|փ: - |LUοmztn3l \禾^ F G [CG&LT;y6KgMrgǟyEִyh~ؗ˲XL =ڢZ /(կ^ OUU6和6 _ @yC}�p�y���lAH�ͯ��zF#�V�6_��}��)�v=J+�$��̤�G�Y�L�b���wS"�7�y^����Z�m���Y:ɛ���J<N_�Y=���U�f���,���y�Q2(J٩P!ͨ�i����1&F0&ૼn�?�x�T��h�Qzw�+����n�)�h��K��2����8g����⮥��A0
���1I�%����Q�Z����{��������w���?x����N�?�<d�S��۫�%a|4�j��z���k�Bak��k-�c�z�g��z���l
&GT;֎小号^,5 / B {]]Ýֳÿ{ _l8gkӫb“+ |(M ^ [J�P��_�..?������x�Z�$
E - 代替;느UE 〜{媘fe1ͷQZ,fe3Jٻb ^ 44&GT; ÿ; &LT;렛{lZfW
S @ { ] 1 Q n[�,t�?����~�n�S�u#SL��n�^��������EC��q�/�y���FE�tpm������e&��oB���z9eY��������P��IK?����̦����w�N��;�;J?����;�/��5���M���rZ��q��]��C�dᖣ��F�nd���}���A5���M�5�.�:��/�_D�?�3����'�c�Z7��}��(OI),ۏi����{�<�w�������DZ?e����'q���eY]=���kj���߬������\qhrRn���l�o-��.���k��_���oD8��GA�P�r��|$��ȈPv~Y�:�[q?�sH�� <��C��ˬ�^N�[ v(��S��l�c�C����3���E5&5�VӪL�T��۔���oQrĈ��/���#[f�5�5"�
[ t vm \ .0 nh aڌWYM
^T | \, 퓜L u B ̌ C r ' % { ) ); fV ] g, &gt; C c2 p 4 }H P ( %j“ } &amp; : Oh\5I l 氪 {/] LBl2I“= Y |&GT;֏N}〜[” 0
:/)Wz3lo.5k&安培; H [jibWWy} 5֝Q |˚F ] KjH5} yNmg ӷǣ&GT;'O泏&LT;千兆克&GT ; - &GT; xQM%LT; | U.3 [[R] 4 E6 [ ] 1 * 8} NWݽ|} | QO |〜UW | i i Z `z ŧ Q} u ! w O R9 ) 〜 g~w6 { wd O / ZuUS݄LI ^&GT; [U1o_J} @@ú//? I7 | CZT(2B〜cWc5'EeFĿꇙ0Ť{W2 / O? YJjK /&GT;:'_升
除了来自此目录的网址,即“excite.com/education”,所有网址都提供了正确的源代码,但这些网址却产生了问题。
任何人请帮助。
提前致谢。
答案 0 :(得分:4)
它对我有用。
private static String getWebPabeSource(String sURL) throws IOException {
URL url = new URL(sURL);
URLConnection urlCon = url.openConnection();
BufferedReader in = null;
if (urlCon.getHeaderField("Content-Encoding") != null
&& urlCon.getHeaderField("Content-Encoding").equals("gzip")) {
in = new BufferedReader(new InputStreamReader(new GZIPInputStream(
urlCon.getInputStream())));
} else {
in = new BufferedReader(new InputStreamReader(
urlCon.getInputStream()));
}
String inputLine;
StringBuilder sb = new StringBuilder();
while ((inputLine = in.readLine()) != null)
sb.append(inputLine);
in.close();
return sb.toString();
}
答案 1 :(得分:2)
尝试以这种方式阅读:
private static String getUrlSource(String url) throws IOException {
URL url = new URL(url);
URLConnection urlConn = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
urlConn.getInputStream(), "UTF-8"));
String inputLine;
StringBuilder a = new StringBuilder();
while ((inputLine = in.readLine()) != null)
a.append(inputLine);
in.close();
return a.toString();
}
并根据网页设置您的编码 - 请注意以下一行:
BufferedReader in = new BufferedReader(new InputStreamReader(
urlConn.getInputStream(), "UTF-8"));
答案 2 :(得分:0)
首先,您必须使用GZIPInputStream解压缩内容。然后将未压缩的流放入Input Stream并使用BufferedReader
读取它使用Apache HTTP Client 4.1.1
Maven依赖
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.1.1</version>
</dependency>
用于解析gzip内容的示例代码。
package com.gzip.simple;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.zip.GZIPInputStream;
import org.apache.http.Header;
import org.apache.http.HttpResponse;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
public class GZIPFetcher {
public static void main(String[] args) {
try {
DefaultHttpClient httpClient = new DefaultHttpClient();
HttpGet getRequest = new HttpGet("http://excite.com/education");
getRequest.addHeader("accept", "application/json");
HttpResponse response = httpClient.execute(getRequest);
if (response.getStatusLine().getStatusCode() != 200) {
throw new RuntimeException("Failed : HTTP error code : "
+ response.getStatusLine().getStatusCode());
}
InputStream instream = response.getEntity().getContent();
// Check whether the content-encoding is gzip or not.
Header contentEncoding = response
.getFirstHeader("Content-Encoding");
if (contentEncoding != null
&& contentEncoding.getValue().equalsIgnoreCase("gzip")) {
instream = new GZIPInputStream(instream);
}
BufferedReader in = new BufferedReader(new InputStreamReader(
instream));
String content;
System.out.println("Output from Server .... \n");
while ((content = in.readLine()) != null)
System.out.println(content);
httpClient.getConnectionManager().shutdown();
} catch (ClientProtocolException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}