Jsoup:如何在许多元素中获取ID和href

时间:2016-10-22 08:46:45

标签: java html web-crawler jsoup html-parser

我需要获取所有元素的ID和href(如彩色框中的图片所示)。我不知道如何确切地说道路并提取所需的信息。我怎么能这样做?

HTML Structure

2 个答案:

答案 0 :(得分:0)

按ID和标签选择,直到找到相关标签,然后按属性获取。请查看下面的代码段:

Document doc = Jsoup.parse("html_file");

Element loginform = doc.getElementById("search_result_container");
Elements inputElements = loginform.getElementsByTag("div");
Element secondDiv = inputElements.get(1);
Elements hyperLinks = secondDiv.getElementsByTag("a");

for (Element alink : hyperLinks) {
    String href = alink.attr("href");
    String id = alink.attr("id");            
}

答案 1 :(得分:0)

好的,我做到了。有用!!感谢SUNNYben,你给了我正确的输入!!!

这是我的解决方案代码:

import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStream;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Steam_GameID_Links
{
    public static void main(String[] args)
    {
        Steam_GameID_Links wc = new Steam_GameID_Links();
        try
        {
            String url = "http://store.steampowered.com/search/?sort_by=_ASC&category1=998&page=1";
            Document document = Jsoup.connect(url).get();
            // nur die Spielnamen
            Elements howMuchPages = document.select(".search_pagination_right");
            String[] stuff = howMuchPages.text().split(" ");
            String tmp = stuff[4].replace(" ", "").replace(".", "");
            StringBuilder sb = new StringBuilder();
            for(int i = 0; i < tmp.length(); i++)
            {
                if(Character.isDigit(tmp.charAt(i)))
                {
                    sb.append(tmp.charAt(i));
                }
            }
            String last = sb.toString().trim();;
            int lastPages = Integer.parseInt(last);
            int counter = 0;
            for(int i = 1; i < lastPages + 1; i++)
            {
                url = "http://store.steampowered.com/search/?sort_by=_ASC&category1=998&page=" + i;
                document = Jsoup.connect(url).get();
                // waehlt zunaechst den ElternKnoten: <div id="search_result_container">
                Element parentNode = document.getElementById("search_result_container");
                Elements childNodes = parentNode.getElementsByAttribute("data-ds-appid");
                for(Element alink : childNodes)
                {
                    String href = alink.attr("href");
                    String id = alink.attr("data-ds-appid");
                    String name = alink.getElementsByClass("title").text();
                    System.out.println("Spiel: " + name + ", ID: " + id + ", SpieleLink: " + href);
                    // wc.writeSpielNameIDLink("Spiel: " + name + ", ID: " + id + ", SpieleLink: " + href + "\n");
                }
            }
        }
        catch(IOException e)
        {
            e.printStackTrace();
        }
    }