如何从C#中的html中提取信息?

时间:2012-07-03 02:39:24

标签: c# html syndication syndication-feed

有人可以教我从C#中提取html中的信息吗? 我正在使用C#中的WinRT类库。

我想从http://lifehacker.com/5923026/remains-of-the-day-google-image-search-gets-knowledge-graph-integration中提取主要内容和图片。

以下是部分网站代码

<html xmlns="http://www.w3.org/1999/xhtml" class="feature_chompcommentimages feature_s3upload feature_switch feature_powwowtest" xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
  **<title>Remains of the Day: Google Image Search Gets Knowledge Graph Integration</title>**
          <meta http-equiv="content-type" content="text/html; charset=utf-8" />
  <meta http-equiv="content-language" content="en" />
  <meta http-equiv="refresh" content="86400" />
  <meta name="robots" content="all" />
                      <meta name="keywords" content="For What It&#039;s Worth, remainders, in brief, Lifehacker" />
                  <meta property="fb:page_id" content="7568536355" />
                              <meta name="title" content="Remains of the Day: Google Image Search Gets Knowledge Graph Integration" />
      **<meta name="description" content="Google updates Image Search with Knowledge Graph integration, VLC for OS X now supports Retina display, Sparrow updates with Retina display and Mountain Lion support, and Amazon introduces barcode scanning app Flow for iOS. " />**
                      <link rel="image_src" href="http://img.gawkerassets.com/img/17rm77tdcfd31jpg/original.jpg" />
          <meta property="og:image" content="http://img.gawkerassets.com/img/17rm77tdcfd31jpg/xlarge.jpg" />
                  <meta property="og:site_name" content="Lifehacker"/>
      <meta property="og:title" content="Remains of the Day: Google Image Search Gets Knowledge Graph Integration" />
      <meta property="og:description" content="Google updates Image Search with Knowledge Graph integration, VLC for OS X now supports Retina display, Sparrow updates with Retina display and Mountain Lion support, and Amazon introduces barcode scanning app Flow for iOS." />
      <meta property="og:type" content="article" />

我可以使用SyndicationFeed.Title.Text(使用Windows.Web.Syndication;)来提取当天的遗骸:Google Image Search获取知识图集成

请帮我提取

<meta name="description" content="Google updates Image Search with Knowledge Graph integration, VLC for OS X now supports Retina display, Sparrow updates with Retina display and Mountain Lion support, and Amazon introduces barcode scanning app Flow for iOS. " />*

我还需要提取

中的主要内容
<div id="container"> <script type="text/javascript">

<!-- %JUMP:More &raquo;% --><\/p>\n<ul>\n<li><a href=\"http:\/\/insidesearch.blogspot.com\/2012\/07\/find-smarter-more-comprehensive-search.html\">Find Smarter, More Comprehensive Search by Image Results<\/a> <i>Google updated its Image Search with a couple of new features. One being an expanded view that lets searchers see the text around matching images, and the other being added support for Knowledge Graph to image search results, which means Google will attempt to identity any photo that you upload or link to and provide more information about the subject.<\/i> [Google Blog]<\/li>\n<li>

内容: “按图像结果查找更智能,更全面的搜索” “Google通过一些新功能更新了图片搜索。其中一个是扩展视图,可让搜索者查看匹配图片周围的文字,另一个是添加支持知识图形图像搜索结果,这意味着谷歌将尝试识别任何身份您上传或链接到的照片,并提供有关该主题的更多信息。[Google Blog]“

非常感谢!!

[12年7月4日]
抱歉,我试图通过直接从html解析从html中提取文本(作为字符串)和图像(链接或BitmapImage),或者首先通过将其转换为xml来解析它。

我使用来自htmlagilitypack.codeplex.com的HtmlAgilityPack和4guysfromrolla.com/articles/011211-1.aspx的教程。 虽然我仍然想知道Metro风格应用程序是否有更好的解决方案,因为HtmlAgilityPack缺乏对它的一些支持。例如,它有将html转换为xml的方法,但WinRT不再支持.NET中的XmlTextReader。

再次感谢

1 个答案:

答案 0 :(得分:0)

Jerry,我建议您使用RSS库,而不是解析此XML。 看看RssToolkit