[C#]获取网站源代码(404 ERROR)

时间:2014-11-24 14:14:05

标签: c# web httpwebrequest

我必须获得一个学校项目的~1000个网站的源代码。我在for循环中使用HTTP Webrequest。但是我列表中超过半数网站返回404错误,因此无法找到该网站。当我在Chrome,Firefox或Internet Explorer上浏览此网站时,一切正常。

继承我的代码以获取源代码:

public string getSource(string url){
        string urlAddress = url;
        HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        if (response.StatusCode == HttpStatusCode.OK)
        {
            Stream receiveStream = response.GetResponseStream();
            StreamReader readStream = null;

            if (response.CharacterSet == null)
            {
                readStream = new StreamReader(receiveStream);
            }
            else
            {
                readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
            }

            data = readStream.ReadToEnd();

            response.Close();
            readStream.Close();
        }
        return data;
    }

由于1000个网站的质量,它可能无效吗?

1 个答案:

答案 0 :(得分:0)

您可能必须将用户代理设置为许多网站的已知浏览器,因为它们会拒绝来自未知浏览器的请求。请在致电request.GetResponse

之前尝试此操作
var agent = "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)";
request.Headers.Add("user-agent", agent);