I'm using jsoup to parse Google image results. I'm trying to get the src
of the image. Here is my code so far. The output is truncated for some reason and I can't access the src
attribute. Does anyone know why this is happening and what I can do to fix it? Thanks so much!
public static void main(String args[]) {
try {
// Does a google image search for "test"
final Document doc = Jsoup.connect("https://www.google.com/search?q=test&tbm=isch").userAgent(USER_AGENT).get();
// selects images
Elements elements = doc.select("img.rg_ic.rg_i");
// cycles through elements and prints attributes
for (Element e : elements) {
System.out.print(e);
}
} catch (IOException e) {
e.printStackTrace();
}
}
Output:
<img class="rg_ic rg_i" data-sz="f" name="XWXPqrX1RFJiaM:" alt="Image result for test" jsaction="load:str.tbn" onload="google.aft&&google.aft(this)">
答案 0 :(得分:2)
以下代码使用jsoup提供前100个图像结果的网址。如果您需要所有结果,则必须使用无头浏览器(我建议使用PhantomJS,请参阅this answer了解相关信息)。
静态html源具有仅存储在JSON objects中的前100个结果的图像网址。为了解析抓取的JSON对象,我使用了JSON.simple
JSON对象包含在具有类<div>
的{{1}}元素中,并且采用以下格式:
rg_meta
因此,对于网址,我们需要提取密钥的值&#34; ou&#34;。
示例代码
{"st":"Uber","tu":"https:\/\/encrypted-tbn3.gstatic.com\/images?q=tbn:ANd9GcTSEUMluu1kigjR3JU40BYfaH0fQ6JW1vk9WScBiXr--lsMILf2","ru":"https:\/\/newsroom.uber.com\/uberkittens-are-back\/","tw":300,"pt":"UberKittens Delivers Kittens to Play or Stay","ou":"https:\/\/newsroom.uber.com\/wp-content\/uploads\/2015\/10\/HQ_uberkittens_blog_960x540_r1v1.jpg","ow":960,"cl":6,"isu":"newsroom.uber.com","rid":"vLA3QXY8xPE4PM","cr":3,"ity":"jpg","sc":1,"ct":15,"s":"Clear Your Calendars\u2014#UberKITTENS Are Back","th":168,"oh":540,"id":"qCR7qXt7VX38iM:","itg":false,"cb":15}
<强>输出强>
// can only grab first 100 results
String userAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36";
String url = "https://www.google.com/search?site=imghp&tbm=isch&source=hp&q=kittens&gws_rd=cr";
List<String> resultUrls = new ArrayList<String>();
try {
Document doc = Jsoup.connect(url).userAgent(userAgent).referrer("https://www.google.com/").get();
Elements elements = doc.select("div.rg_meta");
JSONObject jsonObject;
for (Element element : elements) {
if (element.childNodeSize() > 0) {
jsonObject = (JSONObject) new JSONParser().parse(element.childNode(0).toString());
resultUrls.add((String) jsonObject.get("ou"));
}
}
System.out.println("number of results: " + resultUrls.size());
for (String imageUrl : resultUrls) {
System.out.println(imageUrl);
}
} catch (IOException | ParseException e) {
e.printStackTrace();
}