如何从网站获取HtmlElements

时间:2014-03-05 03:14:27

标签: java jsoup

我正在尝试从网站获取网址和HTML元素。能够从网站获取网址和HTML,但是,当一个网址包含多个元素(如多个输入元素(或)多个textarea元素)时,我只能获得最后一个element。代码如下

GetURLsAndElemens.java

public static void main(String[] args) throws FileNotFoundException,
                IOException, ParseException {

            Properties properties = new Properties();
            properties
                    .load(new FileInputStream(
                            "src//io//servicely//ci//plugin//SeleniumResources.properties"));
            Map<String, String> urls = gettingUrls(properties
                    .getProperty("MAIN_URL"));
            GettingHTMLElements.getHTMLElements(urls);
            // .out.println(urls.size());
            // System.out.println(urls);
        }

        public static Map<String, String> gettingUrls(String mainURL) {
            Document doc = null;
            Map<String, String> urlsList = new HashMap<String, String>();
            try {
                System.out.println("Main URL " + mainURL);

                // need http protocol
                doc = Jsoup.connect(mainURL).get();
                GettingHTMLElements.getInputElements(doc, mainURL);

                // get page title
                // String title = doc.title();
                // System.out.println("title : " + title);

                // get all links
                Elements links = doc.select("a[href]");
                for (Element link : links) {
                    // urlsList.clear();

                    // get the value from href attribute and adding to list
                    if (link.attr("href").contains("http")) {
                        urlsList.put(link.attr("href"), link.text());

                    } else {
                        urlsList.put(mainURL + link.attr("href"), link.text());

                    }

                    // System.out.println(urlsList);
                }

            } catch (IOException e) {
                e.printStackTrace();
            }
            // System.out.println("Total urls are "+urlsList.size());
            // System.out.println(urlsList);
            return urlsList;
        }

GettingHtmlElements.java

static Map<String, HtmlElements> urlList = new HashMap<String, HtmlElements>();

    public static void getHTMLElements(Map<String, String> urls)
            throws IOException {

        getElements(urls);

    }

    public static void getElements(Map<String, String> urls) throws IOException {

        for (Map.Entry<String, String> entry1 : urls.entrySet()) {

            try {

                System.out.println(entry1.getKey());

                Document doc = Jsoup.connect(entry1.getKey()).get();

                getInputElements(doc, entry1.getKey());

            }

            catch (Exception e) {
                e.printStackTrace();
            }

        }

        Map<String,HtmlElements> list = urlList;
        for(Map.Entry<String,HtmlElements> entry1:list.entrySet())
        {
            HtmlElements ele = entry1.getValue();
            System.out.println("url is "+entry1.getKey());
            System.out.println("input name "+ele.getInput_name());
        }
    }

    public static HtmlElements getInputElements(Document doc, String entry1) {

        HtmlElements htmlElements = new HtmlElements();
        Elements inputElements2 = doc.getElementsByTag("input");
        Elements textAreaElements2 = doc.getElementsByTag("textarea");
        Elements formElements3 = doc.getElementsByTag("form");

        for (Element inputElement : inputElements2) {
            String key = inputElement.attr("name");
            htmlElements.setInput_name(key);
            String key1 = inputElement.attr("type");
            htmlElements.setInput_type(key1);
            String key2 = inputElement.attr("class");
            htmlElements.setInput_class(key2);

        }
        for (Element inputElement : textAreaElements2) {
            String key = inputElement.attr("id");
            htmlElements.setTextarea_id(key);
            String key1 = inputElement.attr("name");
            htmlElements.setTextarea_name(key1);

                    }
        for (Element inputElement : formElements3) {
            String key = inputElement.attr("method");
            htmlElements.setForm_method(key);
            String key1 = inputElement.attr("action");
            htmlElements.setForm_action(key1);


        }

        return urlList.put(entry1, htmlElements);

    }

我想要哪些元素作为bean。对于每个网址我都会获得url和htmle元素。但是当url包含多个元素时,我只获得了最后一个元素

1 个答案:

答案 0 :(得分:0)

据我所知,您使用的类HtmlElements不属于JSoup。我不知道它的内部工作原理,但我认为它是某种html节点或其他东西的列表。

但是,您似乎使用此类:

HtmlElements htmlElements = new HtmlElements();
htmlElements.setInput_name(key);

这表示只有 ONE html元素存储在htmlElements变量中。这可以解释为什么只存储最后一个元素 - 你只是一直覆盖一个实例。

这不是很清楚,因为我不知道HtmlElements类。假设HtmlElement作为HtmlElements的单个实例而HtmlElements有一个方法add

HtmlElements htmlElements = new HtmlElements();
...
for (Element inputElement : inputElements2) {
  HtmlElement e = new HtmlElement();
  htmlElements.add(e);
  String key = inputElement.attr("name");
  e.setInput_name(key);
}