您可以尝试这样的事情：

Question

给定一个网址，我希望能够捕获此网址指向的网页标题，以及作为其他信息 - 例如页面第一段中的文本片段？ - 甚至可能是页面上的图像。

当您提交网址时，Digg.com会很好地做到这一点。

如何在.Net c＃中完成这样的事情？

Answer 1

您正在寻找可以解析格式错误的HTML文档的HTML Agility Pack 您可以使用其HTMLWeb类通过HTTP下载网页。

您还可以使用.Net的WebClient class通过HTTP下载文字但是，它无法帮助您解析HTML。

Answer 2

您可以尝试这样的事情：

using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using System.Text;

namespace WebGet
{
    class progMain
    {
        static void Main(string[] args)
        {
            ASCIIEncoding asc = new ASCIIEncoding();
            WebRequest wrq = WebRequest.Create("http://localhost");

            WebResponse wrp = wrq.GetResponse();
            byte [] responseBuf = new byte[wrp.ContentLength];

            int status = wrp.GetResponseStream().Read(responseBuf, 0, responseBuf.Length);
            Console.WriteLine(asc.GetString(responseBuf));
        }
    }
}

获得缓冲区后，您可以处理它，查找段落或图像HTML标记，以提取部分返回的数据。

Answer 3

您可以使用如下功能提取页面标题。您需要修改正则表达式以查找文本的第一段，但由于每个页面不同，这可能会很困难。但是，您可以查找元描述标记并从中获取值。

public static string GetWebPageTitle(string url)
{
   // Create a request to the url
   HttpWebRequest request = HttpWebRequest.Create(url) as HttpWebRequest;

   // If the request wasn't an HTTP request (like a file), ignore it
   if (request == null) return null;

   // Use the user's credentials
   request.UseDefaultCredentials = true;

   // Obtain a response from the server, if there was an error, return nothing
   HttpWebResponse response = null;
   try { response = request.GetResponse() as HttpWebResponse; }
   catch (WebException) { return null; }

   // Regular expression for an HTML title
   string regex = @"(?<=<title.*>)([\s\S]*)(?=</title>)";

   // If the correct HTML header exists for HTML text, continue
   if (new List<string>(response.Headers.AllKeys).Contains("Content-Type"))
      if (response.Headers["Content-Type"].StartsWith("text/html"))
      {
         // Download the page
         WebClient web = new WebClient();
         web.UseDefaultCredentials = true;
         string page = web.DownloadString(url);

         // Extract the title
         Regex ex = new Regex(regex, RegexOptions.IgnoreCase);
         return ex.Match(page).Value.Trim();
      }

   // Not a valid HTML page
   return null;
}

Answer 4

您可以使用Selenium RC（开源，www.seleniumhq.org）来解析页面中的数据等。它是一个带有C＃.Net lib的Web测试自动化工具。

Selenium拥有完整的API来读取html页面上的特定项目。

从url指向的页面获取数据

4 个答案:

您可以尝试这样的事情：