Question

我编写了一个程序来抓取网站以获取数据并输出到Excel工作表。该程序使用Microsoft Visual Studio 2010以C＃编写。

在大多数情况下，从网站获取内容，解析内容并将数据存储在Excel中都没有问题。

然而，一旦我将遇到问题，说有非法字符（例如▶）阻止输出到excel文件，这会导致程序崩溃。我还手动访问了网站，发现了其他非法字符，例如Ú。

我尝试了.Replace()，但代码似乎无法找到这些字符。

string htmlContent = getResponse(url); //get full html from given url
string newHtml = htmlContent.Replace("▶", "?").Replace("Ú", "?");

所以我的问题是，有没有办法从html字符串中删除这些类型的所有字符？（网页的html）以下是我收到的错误消息。

我尝试过安东尼和沃兹的解决方案而且没有用......

enter image description here

Answer 1

请参阅System.Text.Encoding.Convert

使用示例：

var htmlText = // get the text you're trying to convert.

var convertedText = System.Text.Encoding.ASCII.GetString(
    System.Text.Encoding.Convert(
        System.Text.Encoding.Unicode,
        System.Text.Encoding.ASCII,
        System.Text.Encoding.Unicode.GetBytes(htmlText)));

我使用字符串▶Hello World测试了它，它给了我?Hello World。

Answer 2

您可以尝试剥离所有非ASCII字符。

string htmlContent = getResponse(url);
string newHtml = Regex.Replace(htmlContent, @"[^\u0000-\u007F]", "?");

Answer 3

感谢您的回复并感谢您的帮助。

经过几个小时的谷歌搜索后，我找到了解决问题的方法。问题是我必须“清理”我的html字符串。

http://seattlesoftware.wordpress.com/2008/09/11/hexadecimal-value-0-is-an-invalid-character/

上面是我找到的有用的文章，它也提供了代码示例。

删除Excel表格中的非法字符

3 个答案: