当我使用BufferedReader获取HTML时,我需要的部分不在那里

时间:2015-11-07 19:40:24

标签: java html

所以我把这样的代码从站点中的标记中获取一个值:

    try {

        URL url = new URL("google.com");
        BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));

        String inputLine;
        while (in.readLine() != null) {

            inputLine = in.readLine();
        }
        in.close();


    } catch (IOException e) {

        e.printStackTrace();

    }

所以说我需要它来找到“Pizza”,但只有一些代码弹出,所以我无法访问该部分有一种方法我可以打印出WHOLE HTML(使用BufferReader并且没有像Jsoup这样的额外导入),以及然后检查一下?

1 个答案:

答案 0 :(得分:1)

  URL url = new URL("http://www.google.com");
URLConnection uc = url.openConnection();

InputStreamReader input = new InputStreamReader(uc.getInputStream());
BufferedReader in = new BufferedReader(input);
String inputLine;

 FileWriter outFile = new FileWriter("orhancan");
 PrintWriter out = new PrintWriter(outFile);

while ((inputLine = in.readLine()) != null) {
    out.println(inputLine);
}

in.close();
out.close();

File fXmlFile = new File("orhancan");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);


NodeList prelist = doc.getElementsByTagName("body");
System.out.println(prelist.getLength());

有一种更简单的方法可以做到这一点。我建议使用JSoup。使用JSoup,您可以执行以下操作:json     文档doc = Jsoup.connect(“http://en.wikipedia.org/”)。get();     Elements newsHeadlines = doc.select(“#mp-itn b a”); 或者如果你想要身体:

Elements body = doc.select("body");

或者如果您想要所有链接:

Elements links = doc.select("body a");