Question

我制作了一个简单的网络剪贴簿，为我剪下歌词，然后将其写入数据库。一切正常，但由于某种原因，它正在用问号替换一些字符，当我在一个简单的php网页上查看这些信息时，我发现歌词中有很多错误。

I?m = I'm
Let?s = Let's
haven?t = haven't
stuff like that.

我知道错误是在c＃和我的代码中，因为我在写入数据库之前放了一个断点，然后在富文本框中显示它。如何让它正确显示这些字符？

        public static string getSourceCode(string url)
        {
            HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
            HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
            StreamReader sr = new StreamReader(resp.GetResponseStream());
            string sourceCode = sr.ReadToEnd();
            sr.Close();
            resp.Close();
            return sourceCode;
        }
........
string url = txbURL2.Text;

string sourceCode = sourceCode = WorkerClass.getSourceCode(url);
int startIndex = sourceCode.IndexOf("<td valign=\"top\" width=\"100%\">");
sourceCode = sourceCode.Substring(startIndex, sourceCode.Length - startIndex);
........
//Gets Lyric
startIndex = sourceCode.IndexOf("<br><b>Lyrics:</b><br><br>") + 30;
endIndex = sourceCode.IndexOf("     <br><br>", startIndex);
string lyric = sourceCode.Substring(startIndex, endIndex - startIndex) + "";
rtbLyric.Text = lyric;
//End Lyric

Answer 1

问题可能是字符编码。我的猜测是你正在抓取的网页是用UTF8编码的，但是在你转换为ASCII的那一行。

查看名为“What every developer should know about character encoding”的优秀文章了解更多详情。

<强>更新

你可以试试这个，虽然StreamReader无论如何应该默认为UTF-8：

var encoding = System.Text.Encoding.GetEncoding("utf-8");
StreamReader sr = new StreamReader(resp.GetResponseStream(), encoding);

Answer 2

通过在html代码中搜索charset来检查编码你的代码snipplet错过了实际的加载过程，因此无法分辨它出错的地方。

Answer 3

您也可以尝试使用WebClient：

WebClient client = new WebClient { Encoding = Encoding.UTF8 };
string html = client.DownloadString(url);

Web scraper用问号替换一些字符

3 个答案: