如何下载网页的确切源代码

时间:2016-02-03 04:41:01

标签: javascript java html css jsoup

我想下载网页的源代码。我使用了URL方法,即URL url = new URL(“http://a.html”);

和Jsoup方法但没有获得实际源代码中提到的确切数据。例如 -

<input type="image"
       name="ctl00$dtlAlbums$ctl00$imbAlbumImage"    
       id="ctl00_dtlAlbums_ctl00_imbAlbumImage"
       title="Independence Day Celebr..."
       border="0"         
       onmouseover="AlbumImageSlideShow('ctl00_dtlAlbums_ctl00_imbAlbumImage','ctl00_dtlAlbums_ctl00_hdThumbnails','0','Uploads/imagegallary/135/Thumbnails/IMG_3206.JPG','Uploads/imagegallary/135/Thumbnails/');"
       onmouseout="AlbumImageSlideShow('ctl00_dtlAlbums_ctl00_imbAlbumImage','ctl00_dtlAlbums_ctl00_hdThumbnails','1','Uploads/imagegallary/135/Thumbnails/IMG_3206.JPG','Uploads/imagegallary/135/Thumbnails/');" 
       src="Uploads/imagegallary/135/Thumbnails/IMG_3206.JPG"     
       alt="Independence Day Celebr..." 
       style="height:79px;width:148px;border-width:0px;"
/>

在此标记中,最后一个属性“style”未被jsoup的代码检测到。如果我从URL方法下载它,它将样式标记更改为border =“”/&gt;属性。

任何机构都可以告诉我下载网页确切源代码的方法吗? 我的代码是 -

URL url=new URL("http://www.apcob.org/");
InputStream is = url.openStream();  // throws an IOException
BufferedReader  br = new BufferedReader(new InputStreamReader(is));
String line;
File fileDir = new File(contextpath+"\\extractedtxt.txt");
Writer fw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(fileDir), "UTF8"));
while ((line = br.readLine()) != null)
{
 // System.out.println("line\n "+line);
  fw.write("\n"+line);
}
 InputStream in = new FileInputStream(new File(contextpath+"extractedtxt.txt";));
String baseUrl="http://www.apcob.org/";
Document doc=Jsoup.parse(in,"UTF-8",baseUrl);
System.out.println(doc);

我遵循的第二种方法是 -

Document doc = Jsoup.connect(url_of_currentpage).get();

我想在java中执行此操作,并且网站名称为“http://www.apcob.org/”,发生此问题。

5 个答案:

答案 0 :(得分:2)

这可能是由于不同的user agent字符串 - 当您从浏览器浏览页面时,它会发送一个带有浏览器类型的user agent字符串。有些网站会针对不同的浏览器(例如移动设备)响应不同的页面 尝试添加与浏览器相同的user agent字符串。

答案 1 :(得分:2)

您尝试下载的页面会以某种方式通过javascript代码进行修改。 Jsoup是一个html解析器。它没有运行javascript。

如果您想获得可在Chrome中看到的源代码,请使用以下工具之一:

这三个人都可以解析并运行页面内的Javascript代码。

答案 2 :(得分:1)

我想这样可以正常工作,

public static void main(String[] args) throws Exception {
    //Only If you're using a proxy
    //System.setProperty("java.net.useSystemProxies", "true");

    URL url = new URL("http://www.apcob.org/");

    HttpURLConnection yc = (HttpURLConnection) url.openConnection();
    yc.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36");
    BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));

    String inputLine;
    while ((inputLine = in.readLine()) != null)
        System.out.println(inputLine);
    in.close();
}

答案 3 :(得分:0)

这是一个方便的功能来获取网页。使用此获取HTML字符串。然后使用StringDocument解析为JSOUP

public static String fetchPage(String urlFullAddress) throws IOException {
//      String proxy = "10.3.100.207";
//      int port = 8080;
        URL url = new URL(urlFullAddress);
        HttpURLConnection connection = null;
//      Proxy proxyConnect = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(proxy, port));
        connection = (HttpURLConnection) url.openConnection();//proxyConnect);
        connection.setDoOutput(true);
        connection.setDoInput(true);

        connection.addRequestProperty("User-Agent",
                "Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.10')");
        connection.setReadTimeout(5000); // set timeout

        connection.addRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
        connection.addRequestProperty("Accept-Language", "en-US,en;q=0.5");
        connection.addRequestProperty("Accept-Encoding", "gzip, deflate");
        connection.addRequestProperty("connection", "keep-alive");
        System.setProperty("http.keepAlive", "true");

        BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));

        String urlString = "";
        String current;
        while ((current = in.readLine()) != null) {
            urlString += current;
        }

        return urlString;   
}

如果问题与JSOUP Parser有关,请尝试使用http://jericho.htmlparser.net/docs/index.html。它按原样解析HTML,而不纠正错误。

我注意到的其他几件事: 你没有关闭fw。将UTF8替换为UTF-8`。 如果你需要解析很多CSS,请尝试使用CSS-Parser

答案 4 :(得分:0)

通过http获取网页时,网络服务器通常会以某种方式格式化网页;您无法使用php获取http文件的确切内容。 据我所知,完成所要求的唯一方法是使用ftp