例如,我想提取本文中的文字HTML:
<div class="description">
<div style="clear: none;" class="post-fb-like">
<fb:like class=" fb_edge_widget_with_comment fb_iframe_widget" href="http://mashable.com/2011/08/07/3-handy-mobile-apps/" send="true" width="625" height="61"><span><iframe src="http://www.facebook.com/plugins/like.php?api_key=116628718381794&channel_url=http%3A%2F%2Fstatic.ak.fbcdn.net%2Fconnect%2Fxd_proxy.php%3Fversion%3D3%23cb%3Df138585052991e8%26origin%3Dhttp%253A%252F%252Fmashable.com%252Ff15a8eb75cc2b58%26relation%3Dparent.parent%26transport%3Dpostmessage&href=http%3A%2F%2Fmashable.com%2F2011%2F08%2F07%2F3-handy-mobile-apps%2F&layout=standard&locale=en_US&node_type=link&sdk=joey&send=true&show_faces=true&width=625" class="fb_ltr" title="Like this content on Facebook." style="border: medium none; overflow: hidden; height: 29px; width: 625px;" name="f2d40595a65cf36" id="f24fece5e565ec4" scrolling="no"></iframe></span></fb:like>
</div>
<p><img src="http://ec.mashable.com/wp-content/uploads/2009/01/bizspark2.gif" alt="" align="left"><em>The <a href="http://mashable.com/tag/bizspark">Spark of Genius Series</a> highlights a unique feature of startups and is made possible by <a rel="nofollow" href="http://www.microsoftstartupzone.com/BizSpark/Pages/At_a_Glance.aspx?WT.mc_id=MSZ_Mashable_posts" target="_blank">Microsoft BizSpark</a>. If you would like to have your startup considered for inclusion, please see the details <a href="http://mashable.com/bizspark/">here</a>.</em></p>
<p><img src="http://5.mshcdn.com/wp-content/uploads/2011/08/mobile-devices.jpg" alt="" title="mobile devices" class="alignright" height="141" width="225">Each <a href="http://mashable.com/follow/topics/startup-weekend-roundup">weekend</a>, <em>Mashable</em> hand-picks startups we think are building interesting, unique or niche products. </p>
<p>This week, we’ve rounded up startups making mobile applications that bridge the physical and digital worlds for improved communication and enhanced experiences. </p>
<p>TransFire breaks down global communication barriers with its instant and automatic translation capabilities, while Babbleville facilitates neighbor-to-neighbor communication around events or topics. And, Picdish uses time and place to bring friends together over shared mobile food experiences.</p>
<hr>
我还有另一个HTML页面,我想从中提取文本,但它的格式不同。我想从http://www.cnn.com/2011/WORLD/europe/08/12/uk.riots.dan.rivers/index.html?hpt=hp_c2
中提取此文本无论给出哪个文章网址,我如何创建一个提取文本的选择器?
答案 0 :(得分:1)
无论给出哪个文章网址,我如何创建选择器来提取文本?
你做不到。所有网站都有自己的HTML结构。自己打开webbrowser中的页面,右键单击查看源。看。您应该为每个网站创建一个单独的选择器。
对于您的第一个示例,假设它是整个 HTML,因此文本位于<p>
个标记内。然后你可以使用
Document html = Jsoup.parse(yourHtmlString);
Elements paragraphs = html.select("p");
String text = paragraphs.text();
// ...
对于您的CNN网站,根据HTML源代码,您希望获得<p>
的所有<div class="cnn_strycntntlft">
,因此此选择器应执行以下操作:
Document document = Jsoup.connect("http://www.cnn.com/2011/WORLD/europe/08/12/uk.riots.dan.rivers/index.html?hpt=hp_c2").get();
Elements paragraphs = document.select(".cnn_strycntntlft p");
String text = paragraphs.text();
// ...
顺便说一句,只使用他们的RSS提要而不是解析整个HTML会更容易。许多新闻网站都提供了RSS源,正是出于这个目的。