我正在尝试使用jsoup解析此HTML。
我的代码是:
doc = Jsoup.connect(htmlUrl).timeout(1000 * 1000).get();
Elements items = doc.select("item");
Log.d(TAG, "Items size : " + items.size());
for (Element item : items) {
Log.d(TAG, "in for loop of items");
Element titleElement = item.select("title").first();
mTitle = titleElement.text().toString();
Log.d(TAG, "title is : " + mTitle);
Element linkElement = item.select("link").first();
mLink = linkElement.text().toString();
Log.d(TAG, "link is : " + mLink);
Element descElement = item.select("description").first();
mDesc = descElement.text().toString();
Log.d(TAG, "description is : " + mDesc);
}
我得到以下输出:
in for loop of items
D/HtmlParser( 6690): title is : Indonesian president: Some multinationals "take too much"
D/HtmlParser( 6690): link is :
D/HtmlParser( 6690): description is : April 23 - Indonesian President Susilo Bambang Yudhoyono tells a Thomson Reuters Newsmaker event that the country welcomes foreign investment in its resources sector, but must receive a "fair share" of benefits.<div class="feedflare"> <a href="http://feeds.reuters.com/~ff/reuters/audio/newsmakerus/rss/mp3?a=NX3AY96GfGk:hAtGeOq2ESs:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/reuters/audio/newsmakerus/rss/mp3?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.reuters.com/~ff/reuters/audio/newsmakerus/rss/mp3?a=NX3AY96GfGk:hAtGeOq2ESs:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/reuters/audio/newsmakerus/rss/mp3?i=NX3AY96GfGk:hAtGeOq2ESs:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.reuters.com/~ff/reuters/audio/newsmakerus/rss/mp3?a=NX3AY96GfGk:hAtGeOq2ESs:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/reuters/audio/newsmakerus/rss/mp3?i=NX3AY96GfGk:hAtGeOq2ESs:F7zBnMyn0Lo" border="0"></img></a> </div><img src="http://feeds.feedburner.com/~r/reuters/audio/newsmakerus/rss/mp3/~4/NX3AY96GfGk" height="1" width="1"/>
但我希望输出为:
in for loop of items
D/HtmlParser( 6690): title is : Indonesian president: Some multinationals "take too much"
D/HtmlParser( 6690): link is : http://feeds.reuters.com/~r/reuters/audio/newsmakerus/rss/mp3/~3/KDcQe4gF-3U/62828262.mp3
D/HtmlParser( 6690): description is : April 23 - Indonesian President Susilo Bambang Yudhoyono tells a Thomson Reuters Newsmaker event that the country welcomes foreign investment in its resources sector, but must receive a "fair share" of benefits.
我的代码应该更改什么?
如何实现我的目标。请帮帮我!!
提前谢谢!!
答案 0 :(得分:0)
您提取的rss
内容存在2个问题。
link
文字不在<link/>
标记内,但在其外部。escaped html
代码中有一些description
个内容。PFB修改后的代码。
在查看URL
中的Browser
时,我发现了一些干净的html内容,在解析时,您可以轻松提取所需的字段。您可以在userAgent
中将Browser
设置为Jsoup
。但由您来决定如何获取内容。
doc = Jsoup.connect("http://feeds.reuters.com/reuters/audio/newsmakerus/rss/mp3/").timeout(0).get();
System.out.println(doc.html());
System.out.println("================================");
Elements items = doc.select("item");
for (Element item : items) {
Element titleElement = item.select("title").first();
String mTitle = titleElement.text();
System.out.println("title is : " + mTitle);
/*
* The link in the rss is as follows
* <link />http://feeds.reuters.com/~r/reuters/audio/newsmakerus/rss/mp3/~3/NX3AY96GfGk/59621707.mp3
* which doesn't fall in the <link> element but falls under <item> TextNode
*/
String mLink = item.ownText(); //
System.out.println("link is : " + mLink);
Element descElement = item.select("description").first();
/*Unescape the html content, Parse it to a doc, and then fetch only the text leaving behind all the html tags in content
* "/" is a dummy baseURI passed, as we don't care about resolving the links within parsed content.
*/
String mDesc = Parser.parse(Parser.unescapeEntities(descElement.text(), false),"/" ).text();
System.out.println("description is : " + mDesc);
}