Question

我在一个程序中使用C＃来列出MOOC的所有课程资源（例如Coursera）。我不想下载内容，只需获取课程可用的所有资源（例如pdf，视频，文本文件，示例文件等）的列表。

我的问题在于解析html源代码（目前正在使用HtmlAgilityPack）而不下载所有内容。

例如，如果您转到此intro video for a banking course on Coursera并查看来源（Chrome for Developer Tools中的F12），则可以看到页面来源。我可以停止自动播放的视频下载，但仍然可以看到来源。

如何在不下载所有内容的情况下获取C＃中的源代码？我查看了HttpWebRequest标头（问题：超时）和DownloadDataAsync with Cancel（问题：取消异步请求时，Completed Result对象无效）。我也尝试过HtmlAgilityPack的各种负载，但没有成功。

超时：

        HttpWebRequest postRequest = (HttpWebRequest)WebRequest.Create(url);
        postRequest.Timeout = TIMEOUT * 1000000; //Really long
        postRequest.Referer = "https://www.coursera.org"; 

        if (headers != null)
        { //headers here }

        //Deal with cookies
        if (cookie != null)
        { cookieJar.Add(cookie); }

        postRequest.CookieContainer = cookiejar;
        postRequest.Method = "GET";
        postRequest.AllowAutoRedirect = allowRedirect;
        postRequest.ServicePoint.Expect100Continue = true;
        HttpWebResponse postResponse = (HttpWebResponse)postRequest.GetResponse();

有关如何进行的任何提示？

Answer 1

至少有两种方法可以做你要求的事情。首先是使用范围获取。也就是说，指定要读取的文件的范围。您可以致电HttpWebRequest上的AddRange来完成此操作。所以，如果你想要，比如文件的前10千字节，你就写道：

request.AddRange(-10240);

仔细阅读文档中有关该参数含义的内容。如果它为负数，则指定范围的结束点。还有其他您可能感兴趣的AddRange重载。

但并非所有服务器都支持范围。如果这不起作用，你将不得不采取另一种方式。

您可以做的是致电GetResponse，然后开始阅读数据。一旦您阅读了所需数据，就可以停止阅读并关闭流。我稍微修改了你的样本以显示我的意思。

string url = "https://www.coursera.org/course/money";
HttpWebRequest postRequest = (HttpWebRequest)WebRequest.Create(url);
postRequest.Method = "GET";
postRequest.AllowAutoRedirect = true; //allowRedirect;
postRequest.ServicePoint.Expect100Continue = true;
HttpWebResponse postResponse = (HttpWebResponse) postRequest.GetResponse();
int maxBytes = 1024*1024;
int totalBytesRead = 0;
var buffer = new byte[maxBytes];
using (var s = postResponse.GetResponseStream())
{
    int bytesRead;
    // read up to `maxBytes` bytes from the response
    while (totalBytesRead < maxBytes && (bytesRead = s.Read(buffer, 0, maxBytes)) != 0)
    {
        // Here you can save the bytes read to a persistent buffer,
        // or write them to a file.
        Console.WriteLine("{0:N0} bytes read", bytesRead);
        totalBytesRead += bytesRead;
    }
}
Console.WriteLine("total bytes read = {0:N0}", totalBytesRead);

那就是说，我运行了这个样本，它下载了大约6千字节并停止了。我不知道你为什么会遇到超时或数据太多的问题。

请注意，有时在读取整个响应之前尝试关闭流将导致程序挂起。我不确定为什么会发生这种情况，我无法解释为什么它有时会发生。但您可以在关闭流之前调用request.Abort来解决此问题。那就是：

using (var s = postResponse.GetResponseStream())
{
    // do stuff here
    // abort the request before continuing
    postRequest.Abort();
}

如何取消大文件下载但仍然在C＃中获取页面源？

1 个答案: