Java Jsoup Google Image Search result parsing

时间:2016-10-20 19:00:19

标签: java jsoup

I'm using jsoup to parse Google image results. I'm trying to get the src of the image. Here is my code so far. The output is truncated for some reason and I can't access the src attribute. Does anyone know why this is happening and what I can do to fix it? Thanks so much!

public static void main(String args[]) {
    try {
        // Does a google image search for "test"
        final Document doc = Jsoup.connect("https://www.google.com/search?q=test&tbm=isch").userAgent(USER_AGENT).get();

        // selects images
        Elements elements = doc.select("img.rg_ic.rg_i");
            // cycles through elements and prints attributes
            for (Element e : elements) {
                System.out.print(e);
            }


    } catch (IOException e) {
        e.printStackTrace();
    }
}

Output:

<img class="rg_ic rg_i" data-sz="f" name="XWXPqrX1RFJiaM:" alt="Image result for test" jsaction="load:str.tbn" onload="google.aft&&google.aft(this)">

1 个答案:

答案 0 :(得分:2)

以下代码使用jsoup提供前100个图像结果的网址。如果您需要所有结果,则必须使用无头浏览器(我建议使用PhantomJS,请参阅this answer了解相关信息)。

静态html源具有仅存储在JSON objects中的前100个结果的图像网址。为了解析抓取的JSON对象,我使用了JSON.simple

JSON对象包含在具有类<div>的{​​{1}}元素中,并且采用以下格式:

rg_meta

因此,对于网址,我们需要提取密钥的值&#34; ou&#34;。

示例代码

{"st":"Uber","tu":"https:\/\/encrypted-tbn3.gstatic.com\/images?q=tbn:ANd9GcTSEUMluu1kigjR3JU40BYfaH0fQ6JW1vk9WScBiXr--lsMILf2","ru":"https:\/\/newsroom.uber.com\/uberkittens-are-back\/","tw":300,"pt":"UberKittens Delivers Kittens to Play or Stay","ou":"https:\/\/newsroom.uber.com\/wp-content\/uploads\/2015\/10\/HQ_uberkittens_blog_960x540_r1v1.jpg","ow":960,"cl":6,"isu":"newsroom.uber.com","rid":"vLA3QXY8xPE4PM","cr":3,"ity":"jpg","sc":1,"ct":15,"s":"Clear Your Calendars\u2014#UberKITTENS Are Back","th":168,"oh":540,"id":"qCR7qXt7VX38iM:","itg":false,"cb":15}

<强>输出

// can only grab first 100 results
String userAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36";
String url = "https://www.google.com/search?site=imghp&tbm=isch&source=hp&q=kittens&gws_rd=cr";

List<String> resultUrls = new ArrayList<String>();

try {
    Document doc = Jsoup.connect(url).userAgent(userAgent).referrer("https://www.google.com/").get();

    Elements elements = doc.select("div.rg_meta");

    JSONObject jsonObject;
    for (Element element : elements) {
        if (element.childNodeSize() > 0) {
            jsonObject = (JSONObject) new JSONParser().parse(element.childNode(0).toString());
            resultUrls.add((String) jsonObject.get("ou"));
        }
    }

    System.out.println("number of results: " + resultUrls.size());

    for (String imageUrl : resultUrls) {
        System.out.println(imageUrl);
    }

} catch (IOException | ParseException e) {
    e.printStackTrace();
}