我的java代码是:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class celebGrepper {
static class CelebData {
URL link;
String name;
CelebData(URL link, String name) {
this.link=link;
this.name=name;
}
}
public static String grepper(String url) {
URL source;
String data = null;
try {
source = new URL(url);
HttpURLConnection connection = (HttpURLConnection) source.openConnection();
connection.connect();
InputStream is = connection.getInputStream();
/**
* Attempting to fetch an entire line at a time instead of just a character each time!
*/
StringBuilder str = new StringBuilder();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
while((data = br.readLine()) != null)
str.append(data);
data=str.toString();
} catch (IOException e) {
e.printStackTrace();
}
return data;
}
public static ArrayList<CelebData> parser(String html) throws MalformedURLException {
ArrayList<CelebData> list = new ArrayList<CelebData>();
Pattern p = Pattern.compile("<td class=\"image\".*<img src=\"(.*?)\"[\\s\\S]*<td class=\"name\"><a.*?>([\\w\\s]+)<\\/a>");
Matcher m = p.matcher(html);
while(m.find()) {
CelebData current = new CelebData(new URL(m.group(1)),m.group(2));
list.add(current);
}
return list;
}
public static void main(String... args) throws MalformedURLException {
String html = grepper("https://www.forbes.com/celebrities/list/");
System.out.println("RAW Input: "+html);
System.out.println("Start Grepping...");
ArrayList<CelebData> celebList = parser(html);
for(CelebData item: celebList) {
System.out.println("Name:\t\t "+item.name);
System.out.println("Image URL:\t "+item.link+"\n");
}
System.out.println("Grepping Done!");
}
}
它应该获取https://www.forbes.com/celebrities/list/
的整个HTML内容。但是,当我将下面的实际结果与原始页面进行比较时,我发现缺少所需的整个表格!是因为当我开始通过输入流从页面获取字节时页面没有完全加载?请帮我理解。
页面输出:
https://jsfiddle.net/e0771aLz/
如果只提取图像链接和明星的名字,我该怎么办?
我知道尝试使用正则表达式解析HTML是一种非常糟糕的做法,并且是噩梦的东西,但是在Android的某个视频培训课程中,这正是这个人所做的,我只是想跟进,因为它只是在这一课中。