我在java中的HTML fetcher程序返回不完整的结果

时间:2017-10-23 04:15:46

标签: java android html regex

我的java代码是:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class celebGrepper {

    static class CelebData {
        URL link;
        String name;

        CelebData(URL link, String name) {
            this.link=link;
            this.name=name;
        }
    }

    public static String grepper(String url) {
        URL source;
        String data = null;

        try {
            source = new URL(url);
            HttpURLConnection connection = (HttpURLConnection) source.openConnection();
            connection.connect();

            InputStream is = connection.getInputStream();

            /**
             * Attempting to fetch an entire line at a time instead of just a character each time!
             */
            StringBuilder str = new StringBuilder();
            BufferedReader br = new BufferedReader(new InputStreamReader(is));

            while((data = br.readLine()) != null)
                str.append(data);

            data=str.toString();

        } catch (IOException e) {
            e.printStackTrace();
        }

        return data;
    }

    public static ArrayList<CelebData> parser(String html) throws MalformedURLException {
        ArrayList<CelebData> list = new ArrayList<CelebData>();

        Pattern p = Pattern.compile("<td class=\"image\".*<img src=\"(.*?)\"[\\s\\S]*<td class=\"name\"><a.*?>([\\w\\s]+)<\\/a>");
        Matcher m = p.matcher(html);

        while(m.find()) {
            CelebData current = new CelebData(new URL(m.group(1)),m.group(2));
            list.add(current);
        }

        return list;
    }

    public static void main(String... args) throws MalformedURLException {
        String html = grepper("https://www.forbes.com/celebrities/list/");
        System.out.println("RAW Input: "+html);
        System.out.println("Start Grepping...");
        ArrayList<CelebData> celebList = parser(html);
        for(CelebData item: celebList) {
            System.out.println("Name:\t\t "+item.name);
            System.out.println("Image URL:\t "+item.link+"\n");
        }
        System.out.println("Grepping Done!");
    }

}

它应该获取https://www.forbes.com/celebrities/list/的整个HTML内容。但是,当我将下面的实际结果与原始页面进行比较时,我发现缺少所需的整个表格!是因为当我开始通过输入流从页面获取字节时页面没有完全加载?请帮我理解。

页面输出:

https://jsfiddle.net/e0771aLz/

如果只提取图像链接和明星的名字,我该怎么办?

我知道尝试使用正则表达式解析HTML是一种非常糟糕的做法,并且是噩梦的东西,但是在Android的某个视频培训课程中,这正是这个人所做的,我只是想跟进,因为它只是在这一课中。

0 个答案:

没有答案