Question

我正在尝试从亚马逊保存页面源，以便查看商品的价格。当我尝试将其保存到文件时，它只保存大约60行，其中大多数是空白区域。我可以从我的浏览器中看到源代码，它长达数千行。它适用于我尝试搜索的任何页面。下面是我尝试过的链接：http://www.amazon.com/gp/product/B015WCV70W/ref=s9_simh_gw_g147_i2_r?ie=UTF8&fpl=fresh&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=desktop-2&pf_rd_r=0XHXJAF2NQ35BP5Y435K&pf_rd_t=36701&pf_rd_p=dc68ddd1-99ac-45e5-8c23-e9e0811a2b2c&pf_rd_i=desktop

有更简单的方法吗？

这是我的代码：

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.util.Scanner;


public class DownloadPage {

    public static final Scanner in = new Scanner(System.in);

    public static void main(String[] args) throws IOException {

        System.out.print("Enter URL: ");
        savePage(in.nextLine());

    }

    static void savePage(String entURL) throws IOException{
        URL url = new URL(entURL);
        URLConnection con = url.openConnection();
        InputStream is = con.getInputStream();

        BufferedWriter bw = new BufferedWriter(new FileWriter("text.txt"));
        BufferedReader br = new BufferedReader(new InputStreamReader(is));
        String line = null;
        int count = 0;
        while (br.ready()) {
            bw.write(br.readLine());
            bw.newLine();
            count++;
        }
        line = null;
        bw.close();
        System.out.println("wrote successfully " + count);
    }
}

很抱歉，如果我没有将其格式化，那是我的第一篇文章。

Answer 1

网址只是javascript应用的加载点，可以将HTML呈现给您的浏览器。

如果您想捕获呈现的页面，请尝试模拟浏览器的Selenium/WebDriver（并运行javascript应用）。

Answer 2

这是因为你使用了br.ready（），因此每次网络暂停都会导致循环结束这个块给了我20632行html

int count = 0;
while (true) {
    String line = br.readLine();
    if (line == null) {
        break;
    }
    bw.newLine();
       count++;
}

使用Java从Amazon保存页面源

2 个答案: