Question

我是Web Crawling的新用户，我正在使用HttpWebRequest从网站抓取数据。

截至目前，我已成功抓取并从我的wordpress网站获取数据。该数据是简单的用户配置文件数据。（如姓名，电子邮件，AIM ID等......）

现在作为练习我想抓取维基百科，在那里我将使用在我的文本框中输入的值进行搜索，然后使用搜索值抓取维基百科，并从搜索中获取相应的标题。

现在我有以下疑惑/困难。

首先，这是否可能？我听说wiki有robot.txt设置阻止这个。虽然我只是从朋友那里听到这个，但不确定。
我使用的是我之前使用的相同程序，但我没有得到所需的结果。

谢谢！

更新经过@svick的一些解释和帮助，我尝试了以下代码，但仍然无法获得任何值（请参阅最后一行代码，我希望搜索结果页面的html标记）

string searchUrl = "http://en.wikipedia.org/w/index.php?search=Wikipedia&title=Special%3ASearch";

var postData = new StringBuilder();
postData.Append("search=" + model.Query);
postData.Append("&");
postData.Append("title" + "Special:Search");

byte[] data2 = Crawler.GetEncodedData(postData.ToString());

var webRequest = (HttpWebRequest)WebRequest.Create(searchUrl);

webRequest.Method = "POST";
webRequest.UserAgent = "Crawling HW (http://yassershaikh.com/contact-me/)";
webRequest.AllowAutoRedirect = false;

ServicePointManager.Expect100Continue = false;

Stream requestStream = webRequest.GetRequestStream();
requestStream.Write(data2, 0, data2.Length);
requestStream.Close();

var responseCsv = (HttpWebResponse)webRequest.GetResponse();
Stream response = responseCsv.GetResponseStream();

// Todo Parsing
var streamReader = new StreamReader(response);
string val = streamReader.ReadToEnd();

// val is empty !! <-- this is my problem !

这是我的GetEncodedData方法定义。

public static byte[] GetEncodedData(string postData)
    {
        var encoding = new ASCIIEncoding();
        byte[] data = encoding.GetBytes(postData);
        return data;
    }

请帮助我。

Answer 1

您可能不需要使用HttpWebRequest。使用WebClient（或HttpClient，如果您使用的是.Net 4.5）对您来说会更容易。
robots.txt实际上并没有阻止任何事情。如果某些东西不支持它（并且.Net不支持它），它可以访问任何东西。
维基百科会阻止未设置User-Agent header的请求。 And you should use an informative User-Agent string with your contact information.
访问维基百科的更好方法是使用its API，而不是抓取。这样，您将得到一个特别适合由自定义应用程序读取的答案，格式为XML或JSON。 There are also dumps containing all information from Wikipedia available for download.

编辑：新发布的代码存在问题，即您的查询会对搜索到的文章返回302 Moved Temporarily响应（如果存在）。请删除禁止AllowAutoRedirect的行，或将&fulltext=Search添加到您的查询中，这意味着您不会被重定向。

使用ASP.NET HttpWebRequest抓取Wikipedia

1 个答案: