我需要创建web scraper实用程序,它通过URL获取Web资源。然后计算网页上提供的单词的出现次数和字符数。
URL url = new URL(urlStr);
URLConnection connection = url.openConnection();
InputStream inputStream = connection.getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream,"UTF-8"));
有了这个,我可以获取页面上的所有文本(和html标签),以便我接下来做什么?
有人可以帮我吗?一些doc或sthg要阅读。我只需要使用JavaSE。不能使用3d派对库。
答案 0 :(得分:0)
例如,您有page.html:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>Login Page</title>
</head>
<body>
<div id="login" class="simple" >
<form action="login.do">
Username : <input id="username" type="text" />
Password : <input id="password" type="password" />
<input id="submit" type="submit" />
<input id="reset" type="reset" />
</form>
</div>
</body>
</html>
要解析它,您可以:
import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
/**
* Java Program to parse/read HTML documents from File using Jsoup library.
*/
public class HTMLParser{
public static void main(String args[]) {
// Parse HTML String using JSoup library
String HTMLSTring = "<!DOCTYPE html>"
+ "<html>"
+ "<head>"
+ "<title>JSoup Example</title>"
+ "</head>"
+ "<body>"
+ "<table><tr><td><h1>HelloWorld</h1></tr>"
+ "</table>"
+ "</body>"
+ "</html>";
Document html = Jsoup.parse(HTMLSTring);
String title = html.title();
String h1 = html.body().getElementsByTag("h1").text();
System.out.println("Input HTML String to JSoup :" + HTMLSTring);
System.out.println("After parsing, Title : " + title);
System.out.println("Afte parsing, Heading : " + h1);
// JSoup Example 2 - Reading HTML page from URL
Document doc;
try {
doc = Jsoup.connect("http://google.com/").get();
title = doc.title();
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("Jsoup Can read HTML page from URL, title : " + title);
// JSoup Example 3 - Parsing an HTML file in Java
//Document htmlFile = Jsoup.parse("login.html", "ISO-8859-1"); // wrong
Document htmlFile = null;
try {
htmlFile = Jsoup.parse(new File("login.html"), "ISO-8859-1");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} // right
title = htmlFile.title();
Element div = htmlFile.getElementById("login");
String cssClass = div.className(); // getting class form HTML element
System.out.println("Jsoup can also parse HTML file directly");
System.out.println("title : " + title);
System.out.println("class of div tag : " + cssClass);
}
}
输出:
Input HTML String to JSoup :<!DOCTYPE html><html><head><title>JSoup Example</title></head><body><table><tr><td><h1>HelloWorld</h1></tr></table></body></html>
After parsing, Title : JSoup Example
Afte parsing, Heading : HelloWorld
Jsoup Can read HTML page from URL, title : Google
Jsoup can also parse HTML file directly
title : Login Page
class of div tag : simple