如何使用Jsoup从html数据中获取图像源和描述

时间:2015-12-10 11:55:24

标签: java html jsoup

我正在尝试使用ROME API解析原子提要以提取提要。原子提要给了我内容属性,其中包含文章的图像和描述。 这是原子饲料的网址:https://news.google.com/news/section?output=atom&ned=in&q=narendra%20modi。 现在我想从内容部分中提取图像和描述。

{
"authors": [
    "Microsoft"
],
"description": "Music store application on ASP.NET 5",
"version": "1.0.0-*",
"compilationOptions": { "warningsAsErrors": true, "define": [ "DEMO", "TESTING" ] },
"compile": [
    "../../shared/**/*.cs"
],
"publishExclude": "*.cmd",
"webroot": "wwwroot",
"dependencies": {
    "EntityFramework.InMemory": "7.0.0-*",
    "EntityFramework.MicrosoftSqlServer": "7.0.0-*",
    "Microsoft.AspNet.Antiforgery": "1.0.0-*",
    "Microsoft.AspNet.Authentication.Cookies": "1.0.0-*",
    "Microsoft.AspNet.Authentication.Facebook": "1.0.0-*",
    "Microsoft.AspNet.Authentication.Google": "1.0.0-*",
    "Microsoft.AspNet.Authentication.MicrosoftAccount": "1.0.0-*",
    "Microsoft.AspNet.Authentication.OpenIdConnect": "1.0.0-*",
    "Microsoft.AspNet.Authentication.Twitter": "1.0.0-*",
    "Microsoft.AspNet.Diagnostics.Entity": "7.0.0-*",
    "Microsoft.AspNet.Identity.EntityFramework": "3.0.0-*",
    "Microsoft.AspNet.Mvc": "6.0.0-*",
    "Microsoft.AspNet.Mvc.TagHelpers": "6.0.0-*",
    "Microsoft.AspNet.Server.IIS": "1.0.0-*",
    "Microsoft.AspNet.Server.Kestrel": "1.0.0-*",
    "Microsoft.AspNet.Server.WebListener": "1.0.0-*",
    "Microsoft.AspNet.Session": "1.0.0-*",
    "Microsoft.AspNet.StaticFiles": "1.0.0-*",
    "Microsoft.AspNet.Tooling.Razor": "1.0.0-*",
    "Microsoft.Extensions.CodeGenerators.Mvc": "1.0.0-*",
    "Microsoft.Extensions.Configuration.CommandLine": "1.0.0-*",
    "Microsoft.Extensions.Configuration.EnvironmentVariables": "1.0.0-*",
    "Microsoft.Extensions.Configuration.Json": "1.0.0-*",
    "Microsoft.Extensions.Logging.Console": "1.0.0-*"
},
"commands": {
    "gen": "Microsoft.Extensions.CodeGeneration",
    "run": "run server.urls=http://localhost:5003",
    "web": "Microsoft.AspNet.Server.Kestrel",
    "kestrel": "Microsoft.AspNet.Hosting --server Microsoft.AspNet.Server.Kestrel --server.urls http://localhost:5004",
    "weblistener": "Microsoft.AspNet.Hosting --server Microsoft.AspNet.Server.WebListener --server.urls http://localhost:5002"
},
"frameworks": {
    "dnx451": { },
    "dnxcore50": {
        "dependencies": {
            "System.Runtime.Serialization.Primitives": "4.0.10-*"
        }
    }
}

对于图像,我尝试了以下jsoup代码:

 <entry>
<id>tag:news.google.com,2005:cluster=http://www.ndtv.com/india-news/not-just-gst-stuck-in-parliament-matter-of-sorrow-pm-narendra-modi-1253222</id>
<title type="html">'Not Just GST Stuck In Parliament. Matter of Sorrow': PM Narendra Modi - NDTV</title>
<updated>2015-12-10T06:03:54Z</updated>
<link rel="alternate" type="text/html" href="http://news.google.com/news/url?sa=t&amp;fd=R&amp;ct2=in&amp;usg=AFQjCNE53SQd2skoJLxBTVlYWHdgDBCl7Q&amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;cid=52779006372283&amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;url=http://www.ndtv.com/india-news/not-just-gst-stuck-in-parliament-matter-of-sorrow-pm-narendra-modi-1253222" hreflang="en"/>
<content type="html">&lt;table border="0" cellpadding="2" cellspacing="7" style="vertical-align:top;">&lt;tr>&lt;td width="80" align="center" valign="top">&lt;font style="font-size:85%;font-family:arial,sans-serif">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNE53SQd2skoJLxBTVlYWHdgDBCl7Q&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.ndtv.com/india-news/not-just-gst-stuck-in-parliament-matter-of-sorrow-pm-narendra-modi-1253222">&lt;img src="//t3.gstatic.com/images?q=tbn:ANd9GcSNi4SJFo9q9PXKPOjJkiUlfk2GFRzRoBlwK6UsiSQ8np66JDvgQiYTdN4Fknntb7bVjdR-NuM" alt="" border="1" width="80" height="80">&lt;br>&lt;font size="-2">NDTV&lt;/font>&lt;/a>&lt;/font>&lt;/td>&lt;td valign="top" class="j">&lt;font style="font-size:85%;font-family:arial,sans-serif">&lt;br>&lt;div style="padding-top:0.8em;">&lt;img alt="" height="1" width="1">&lt;/div>&lt;div class="lh">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNE53SQd2skoJLxBTVlYWHdgDBCl7Q&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.ndtv.com/india-news/not-just-gst-stuck-in-parliament-matter-of-sorrow-pm-narendra-modi-1253222">&lt;b>&amp;#39;Not Just GST Stuck In Parliament. Matter of Sorrow&amp;#39;: PM &lt;b>Narendra Modi&lt;/b>&lt;/b>&lt;/a>&lt;br>&lt;font size="-1">&lt;b>&lt;font color="#6f6f6f">NDTV&lt;/font>&lt;/b>&lt;/font>&lt;br>&lt;font size="-1">With repeated disruptions stalling legislation including the GST or Goods and Services Tax, Prime Minister &lt;b>Narendra Modi&lt;/b> today said it was a &amp;quot;matter of sorrow&amp;quot; that Parliament was not running. &amp;quot;It is not only GST, but many pro-poor steps are stuck in&amp;nbsp;...&lt;/font>&lt;br>&lt;font size="-1">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNEVhO7UtISsITzRIFwxTVFwK8BTDQ&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.india.com/news/india/narendra-modis-stern-message-to-congress-democracy-cannot-run-on-whims-of-some-773082/">&lt;b>Narendra Modi&amp;#39;s&lt;/b> stern message to Congress: Democracy cannot run on whims of some&lt;/a>&lt;font size="-1" color="#6f6f6f">&lt;nobr>India.com&lt;/nobr>&lt;/font>&lt;/font>&lt;br>&lt;font size="-1">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNGkBqqpn2OhEI6w68lLCIXMDppu-Q&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.mid-day.com/articles/jagran-forum-catch-pm-narendra-modi-other-leaders-live/16757192">Jagran Forum: Catch PM &lt;b>Narendra Modi&lt;/b>, other leaders live&lt;/a>&lt;font size="-1" color="#6f6f6f">&lt;nobr>Mid-Day&lt;/nobr>&lt;/font>&lt;/font>&lt;br>&lt;font size="-1">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNHPkB8Wy_-cDqqZrdfcn1cVUKP-Kg&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.oneindia.com/india/democracy-cant-be-restricted-to-elections-only-narendra-modi-1951641.html">Democracy can&amp;#39;t be restricted to elections only, says &lt;b>Narendra Modi&lt;/b>&lt;/a>&lt;font size="-1" color="#6f6f6f">&lt;nobr>Oneindia&lt;/nobr>&lt;/font>&lt;/font>&lt;br>&lt;font size="-1" class="p">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNFhxDKEsImpQqu0GccMt4MCiPydVw&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.abplive.in/india-news/everyone-must-feel-he-or-she-is-working-for-indias-progress-says-narendra-modi-258229">&lt;nobr>ABP Live&lt;/nobr>&lt;/a>&lt;/font>&lt;br>&lt;font class="p" size="-1">&lt;a class="p" href="http://news.google.com/news/more?ncl=dac7xEJd70rfdkM8gcjOwSJn8BK9M&amp;amp;authuser=0&amp;amp;ned=in">&lt;nobr>&lt;b>all 29 news articles&amp;nbsp;&amp;raquo;&lt;/b>&lt;/nobr>&lt;/a>&lt;/font>&lt;/div>&lt;/font>&lt;/td>&lt;/tr>&lt;/table></content>
</entry>

但它什么也没有回报。另外我不知道如何继续提取描述:

Elements img = doc.getElementsByTag("img");
         for (Element el : img) {
             System.out.println("Image Found!");
             System.out.println("src attribute is : "+el.attr("src"));
         }

请帮助我。

1 个答案:

答案 0 :(得分:0)

试试这段代码。请注意,RSS源是直接使用Jsoup获取的。

Document news = Jsoup.connect("http://news.google.com/news/section?output=atom&ned=in&q=narendra%20modi").get();

int i=0;
for (Element entryContent : news.select("entry > content")) {
    System.out.format("\n## ENTRY %d\n", ++i);
    for (Element el : Jsoup.parse(entryContent.text()).select("img[src], tr td.j font[size]:nth-of-type(2)")) {

        String elementTagName = el.tagName();  

        if (elementTagName.equalsIgnoreCase("img")) {
            System.out.println("src attribute is : " + el.attr("src"));
        } else if (elementTagName.equalsIgnoreCase("font")) {
            System.out.println("description is : " + el.text());
        } else {
            System.out.println("Unexpected element >> " + el.html());
        }
    }
}

SAMPLE OUTPUT

## ENTRY 1
src attribute is : //t0.gstatic.com/images?q=tbn:ANd9GcSLee4ulBtCEOMSuDuLHCAjDZwmlaVaXJVdC09133QbK3X1OpZH3s1RBplznEadxqV5memM0dh3
description is : With repeated disruptions stalling legislation including the GST or Goods and Services Tax, Prime Minister Narendra Modi today said it was a "matter of sorrow" that Parliament was not running. "It is not only GST, but many pro-poor steps are stuck in ...

## ENTRY 2
src attribute is : //t1.gstatic.com/images?q=tbn:ANd9GcQdJPtLOBi9F2Ktov11_x5kqHC4inID47xKD3we_ZC5rHP1Lps96sYHs_N0pBO9WkDj5KKuEa8
description is : Prime Minister Narendra Modi topped the charts of Facebook under the most-viewed

(...)

在JSoup 1.8.3上测试