我正在开发一个RSS提要,它正在从亚马逊RSS提要书中提取数据。我正在使用C#.NET Compact Framework 3.5。我可以从RSS提要中的节点获得该书的标题,发布的日期等。但是,本书的价格嵌入在描述节点中的整个HTML堆中。我如何只提取价格而不是HTML的负载?
if (nodeChannel.ChildNodes[i].Name == "item")
{
nodeItem = nodeChannel.ChildNodes[i];
row = new ListViewItem();
row.Text = nodeItem["title"].InnerText;
row.SubItems.Add(nodeItem["description"].InnerText);
listBooks.Items.Add(row);
}
描述节点中间的价格示例
<description><![CDATA[ <div class="hreview" style="clear:both;"> <div class="item"> <div style="float:left;" class="tgRssImage"><a class="url" href="https://rads.stackoverflow.com/amzn/click/com/B0013FDM7E" rel="nofollow noreferrer"><img src="http://ecx.images-amazon.com/images/I/51MvRlzFlpL._SL160_SS160_.jpg" width="160" alt="I Am Legend (Widescreen Single-Disc Edition)" class="photo" height="160" border="0" /></a></div> <span class="tgRssTitle fn summary">I Am Legend (Widescreen Single-Disc Edition) (<span class="tgRssBinding">DVD</span>)<br />By <span class="tgRssAuthor">Will Smith</span><br /></span> </div> <div class="description"> <br /> <span style="display: block;" class="tgRssPriceBlock"><span class="tgProductPriceLine"><a href="https://rads.stackoverflow.com/amzn/click/com/B0013FDM7E" rel="nofollow noreferrer">Buy new</a>: <span class="tgProductPrice">$5.49</span></span><br /><span class="tgProductUsedPrice"><a href="http://www.amazon.com/gp/offer-listing/B0013FDM7E/ref=tag_rso_rs_eofr_used" id="tag_rso_rs_eofr_used">285 used and new</a> from <span class="tgProductPrice">$1.00</span></span><br /></span> <span class="tgRssReviews">Customer Rating: <img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-3-5._V192240731_.gif" width="64" alt="3.6" align="absbottom" height="12" border="0" /><br /></span> <br /> <span class="tgRssProductTag"></span> <span class="tgRssAllTags">Customer tags: <a href="http://www.amazon.com/tag/science%20fiction/ref=tag_rss_rs_itdp_item_at">science fiction</a>(92), <a href="http://www.amazon.com/tag/will%20smith/ref=tag_rss_rs_itdp_item_at">will smith</a>(79), <a href="http://www.amazon.com/tag/horror/ref=tag_rss_rs_itdp_item_at">horror</a>(51), <a href="http://www.amazon.com/tag/action/ref=tag_rss_rs_itdp_item_at">action</a>(43), <a href="http://www.amazon.com/tag/adventure/ref=tag_rss_rs_itdp_item_at">adventure</a>(34), <a href="http://www.amazon.com/tag/fantasy/ref=tag_rss_rs_itdp_item_at">fantasy</a>(33), <a href="http://www.amazon.com/tag/dvd/ref=tag_rss_rs_itdp_item_at">dvd</a>(30), <a href="http://www.amazon.com/tag/movie/ref=tag_rss_rs_itdp_item_at">movie</a>(20), <a href="http://www.amazon.com/tag/zombies/ref=tag_rss_rs_itdp_item_at">zombies</a>(14), <a href="http://www.amazon.com/tag/i%20am%20legend/ref=tag_rss_rs_itdp_item_at">i am legend</a>(6), <a href="http://www.amazon.com/tag/bad%20sci-fi/ref=tag_rss_rs_itdp_item_at">bad sci-fi</a>(4), <a href="http://www.amazon.com/tag/mutants/ref=tag_rss_rs_itdp_item_at">mutants</a>(4)<br /></span> </div></div>]]></description>
5.49美元就在那个烂摊子里
答案 0 :(得分:1)
这可能是一个愚蠢的想法,但如何在class="tgProductPrice">
之后进行字符串搜索?然后提取followign char,直到你点击结束标记</span>
。
你不需要加载任何html,你可以在描述中使用它。
这对你有用吗?
答案 1 :(得分:1)
该描述看起来非常糟糕,如果您没有获得该RSS源的不同版本的任何可能性,我认为唯一的解决方案是解析您在描述中的HTML。
为此,您可以使用HTML Agility Pack(尚未使用它,但它是从.NET解析HTML的推荐解决方案)或使用正则表达式或文本搜索来查找该标记并提取价格(对我来说这感觉有些笨拙,如果RSS发生变化,可能会导致需要做很多改变)
编辑:我已经完成了字符串搜索与正则表达式的结合,这是维护的噩梦,但考虑到你的情况并且只有一个值,它可能没问题。
答案 2 :(得分:0)
using CsQuery; //get CsQuery from nuget packages
path = textBox1.Text;
var dom = CQ.CreateFromUrl(path);
var divContent = dom.Select("#priceblock_ourprice").Text();
//priceblock_ourprice is an id of span where price is written
label1.Text = divContent.ToString();