Question

我需要创建web scraper实用程序，它通过URL获取Web资源。然后计算网页上提供的单词的出现次数和字符数。

URL url = new URL(urlStr);
URLConnection connection = url.openConnection();
InputStream inputStream = connection.getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream,"UTF-8"));

有了这个，我可以获取页面上的所有文本（和html标签），以便我接下来做什么？

有人可以帮我吗？一些doc或sthg要阅读。我只需要使用JavaSE。不能使用3d派对库。

Answer 1

例如，您有page.html：

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
        <title>Login Page</title>
    </head>
    <body>
        <div id="login" class="simple" >
            <form action="login.do">
                Username : <input id="username" type="text" />
                Password : <input id="password" type="password" />
                <input id="submit" type="submit" />
                <input id="reset" type="reset" />
            </form>
        </div>
    </body>
</html>

要解析它，您可以：

import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

/**
* Java Program to parse/read HTML documents from File using Jsoup library.
*/
public class HTMLParser{

    public static void main(String args[]) {

        // Parse HTML String using JSoup library
        String HTMLSTring = "<!DOCTYPE html>"
                + "<html>"
                + "<head>"
                + "<title>JSoup Example</title>"
                + "</head>"
                + "<body>"
                + "<table><tr><td><h1>HelloWorld</h1></tr>"
                + "</table>"
                + "</body>"
                + "</html>";

        Document html = Jsoup.parse(HTMLSTring);
        String title = html.title();
        String h1 = html.body().getElementsByTag("h1").text();

        System.out.println("Input HTML String to JSoup :" + HTMLSTring);
        System.out.println("After parsing, Title : " + title);
        System.out.println("Afte parsing, Heading : " + h1);

        // JSoup Example 2 - Reading HTML page from URL
        Document doc;
        try {
            doc = Jsoup.connect("http://google.com/").get();
            title = doc.title();
        } catch (IOException e) {
            e.printStackTrace();
        }

        System.out.println("Jsoup Can read HTML page from URL, title : " + title);

        // JSoup Example 3 - Parsing an HTML file in Java
        //Document htmlFile = Jsoup.parse("login.html", "ISO-8859-1"); // wrong
        Document htmlFile = null;
        try {
            htmlFile = Jsoup.parse(new File("login.html"), "ISO-8859-1");
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } // right
        title = htmlFile.title();
        Element div = htmlFile.getElementById("login");
        String cssClass = div.className(); // getting class form HTML element

        System.out.println("Jsoup can also parse HTML file directly");
        System.out.println("title : " + title);
        System.out.println("class of div tag : " + cssClass);
    }
}

输出：

Input HTML String to JSoup :<!DOCTYPE html><html><head><title>JSoup Example</title></head><body><table><tr><td><h1>HelloWorld</h1></tr></table></body></html>
After parsing, Title : JSoup Example
Afte parsing, Heading : HelloWorld
Jsoup Can read HTML page from URL, title : Google
Jsoup can also parse HTML file directly
title : Login Page
class of div tag : simple

解析HTML（网页）JavaSE

1 个答案: