在.NET和C#中从网站提取数据的问题

时间:2010-06-14 20:13:24

标签: c# asp.net httpwebresponse streamreader web-scraping

我写了一个网页抓取程序,转到页面列表并将所有html写入文件。问题是,当我拉出一段文字时,一些字符被写成' '。如何将这些字符拖入我的文本文件?这是我的代码:

string baseUri = String.Format("http://www.rogersmushrooms.com/gallery/loadimage.asp?did={0}&blockName={1}", id.ToString(), name.Trim());

// our third request is for the actual webpage after the login.
HttpWebRequest request =
(HttpWebRequest)WebRequest.Create(baseUri);
request.Method = "GET";
request.UserAgent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1)";
//get the response object, so that we may get the session cookie.
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader reader = new StreamReader(response.GetResponseStream());

// and read the response
string page = reader.ReadToEnd();

StreamWriter SW;
string filename = string.Format("{0}.txt", id.ToString());
SW = File.AppendText("C:\\Share\\" + filename);

SW.Write(page);

reader.Close();
response.Close();

3 个答案:

答案 0 :(得分:2)

您正在将名为loadimage的网页保存到文本文件中。你确定这真的是全文吗?

无论哪种方式,您都可以使用System.Net.WebClient.DownloadFile()保存自己的大量代码。

答案 1 :(得分:1)

您需要在此行中指定编码:

StreamReader reader = new StreamReader(response.GetResponseStream());

File.AppendText("C:\\Share\\" + filename);使用UTF-8

答案 2 :(得分:0)

指定Unicode编码,如下所示:

New StreamReader(response.GetResponseStream(), Text.Encoding.UTF8)

..与StreamWriter相同