我想使用.Net的WebClient
类下载网页,提取标题(即<title>
和</title>
之间的内容)并将页面保存到文件中。
问题是,页面以UTF-8编码,System.IO.StreamWriter
在使用带有此类字符的文件名时会引发异常。
我用Google搜索并尝试了几种将UTF8转换为ANSI的方法,但无济于事。有人有这方面的工作代码吗?
'Using WebClient asynchronous downloading
Private Sub AlertStringDownloaded(ByVal sender As Object,
ByVal e As DownloadStringCompletedEventArgs)
If e.Cancelled = False AndAlso e.Error Is Nothing Then
Dim Response As String = CStr(e.Result)
'Doesn't work
Dim resbytes() As Byte = Encoding.UTF8.GetBytes(Response)
Response = Encoding.Default.GetString(Encoding.Convert(Encoding.UTF8,
Encoding.Default, resbytes))
Dim title As Regex = New Regex("<title>(.+?) \(",
RegexOptions.Singleline)
Dim m As Match
m = title.Match(Response)
If m.Success Then
Dim MyTitle As String = m.Groups(1).Value
'Illegal characters in path.
Dim objWriter As New System.IO.StreamWriter("c:\" & MyTitle & ".txt")
objWriter.Write(Response)
objWriter.Close()
End If
End If
End Sub
编辑:感谢大家的帮助。事实证明,错误不是由于UTF8,而是页面标题部分中隐藏的LF字符,这显然是路径中的非法字符。
编辑:这是删除文件名/路径中的一些非法字符的简单方法:
Dim MyTitle As String = m.Groups(1).Value
Dim InvalidChars As String = New String(Path.GetInvalidFileNameChars()) + New String(Path.GetInvalidPathChars())
For Each c As Char In InvalidChars
MyTitle = MyTitle.Replace(c.ToString(), "")
Next
编辑:以下是告诉WebClient预期UTF-8的方法:
Dim webClient As New WebClient
AddHandler webClient.DownloadStringCompleted, AddressOf AlertStringDownloaded
webClient.Encoding = Encoding.UTF8
webClient.DownloadStringAsync(New Uri("www.acme.com"))
答案 0 :(得分:1)
我不认为这个问题与UTF-8有关。我认为如果它出现在同一行,你的正则表达式将包含</title>
。 Windows文件名中的字符<>
无效。
如果这不是问题,那么查看MyTitle
的一些示例输入和输出值会很有帮助。