我有一种下载网页并提取标题标签的方法,但根据网站的不同,结果可以编码或者使用错误的字符集。是否有一种防弹的方式来获取网站标题不同的编码?
我测试的一些网址有不同的结果:
我使用的方法:
private string GetUrlTitle(Uri uri)
{
string title = "";
using (HttpClient client = new HttpClient())
{
HttpResponseMessage response = null;
response = client.GetAsync(uri).Result;
if (!response.IsSuccessStatusCode)
{
string errorMessage = "";
try
{
XmlSerializer xml = new XmlSerializer(typeof(HttpError));
HttpError error = xml.Deserialize(response.Content.ReadAsStreamAsync().Result) as HttpError;
errorMessage = error.Message;
}
catch (Exception)
{
errorMessage = response.ReasonPhrase;
}
throw new Exception(errorMessage);
}
var html = response.Content.ReadAsStringAsync().Result;
title = Regex.Match(html, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;
}
if (title == string.Empty)
{
title = uri.ToString();
}
return title;
}
答案 0 :(得分:0)
字符集并不总是出现在标题中,因此我们还必须检查元标记,或者如果它不存在,则回退到UTF8(或其他内容)。此外,标题可能会被编码,所以我们只需要解码它。
结果
以下代码来自github项目Abot。我对它进行了一些修改。
private string GetUrlTitle(Uri uri)
{
string title = "";
using (HttpClient client = new HttpClient())
{
HttpResponseMessage response = client.GetAsync(uri).Result;
if (!response.IsSuccessStatusCode)
{
throw new Exception(response.ReasonPhrase);
}
var contentStream = response.Content.ReadAsStreamAsync().Result;
var charset = response.Content.Headers.ContentType.CharSet ?? GetCharsetFromBody(contentStream);
Encoding encoding = GetEncodingOrDefaultToUTF8(charset);
string content = GetContent(contentStream, encoding);
Match titleMatch = Regex.Match(content, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase);
if (titleMatch.Success)
{
title = titleMatch.Groups["Title"].Value;
// decode the title in case it have been encoded
title = WebUtility.HtmlDecode(title).Trim();
}
}
if (string.IsNullOrWhiteSpace(title))
{
title = uri.ToString();
}
return title;
}
private string GetContent(Stream contentStream, Encoding encoding)
{
contentStream.Seek(0, SeekOrigin.Begin);
using (StreamReader sr = new StreamReader(contentStream, encoding))
{
return sr.ReadToEnd();
}
}
/// <summary>
/// Try getting the charset from the body content.
/// </summary>
/// <param name="contentStream"></param>
/// <returns></returns>
private string GetCharsetFromBody(Stream contentStream)
{
contentStream.Seek(0, SeekOrigin.Begin);
StreamReader srr = new StreamReader(contentStream, Encoding.ASCII);
string body = srr.ReadToEnd();
string charset = null;
if (body != null)
{
//find expression from : http://stackoverflow.com/questions/3458217/how-to-use-regular-expression-to-match-the-charset-string-in-html
Match match = Regex.Match(body, @"<meta(?!\s*(?:name|value)\s*=)(?:[^>]*?content\s*=[\s""']*)?([^>]*?)[\s""';]*charset\s*=[\s""']*([^\s""'/>]*)", RegexOptions.IgnoreCase);
if (match.Success)
{
charset = string.IsNullOrWhiteSpace(match.Groups[2].Value) ? null : match.Groups[2].Value;
}
}
return charset;
}
/// <summary>
/// Try parsing the charset or fallback to UTF8
/// </summary>
/// <param name="charset"></param>
/// <returns></returns>
private Encoding GetEncodingOrDefaultToUTF8(string charset)
{
Encoding e = Encoding.UTF8;
if (charset != null)
{
try
{
e = Encoding.GetEncoding(charset);
}
catch
{
}
}
return e;
}
答案 1 :(得分:-1)
您可以尝试获取所有字节并使用您想要的任何编码转换为string
,只需使用Encoding
类。它会是这样的:
private string GetUrlTitle(Uri uri)
{
string title = "";
using (HttpClient client = new HttpClient())
{
var byteData = await client.GetByteArrayAsync(url);
string html = Encoding.UTF8.GetString(byteData);
title = Regex.Match(html, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;
}
return title;
}
我希望它对您有所帮助,如果有,请将其标记为答案。
答案 2 :(得分:-3)
这可能会帮到你。 使用全球化
using System;
using System.Globalization;
public class Example
{
public static void Main()
{
string[] values = { "a tale of two cities", "gROWL to the rescue",
"inside the US government", "sports and MLB baseball",
"The Return of Sherlock Holmes", "UNICEF and children"};
TextInfo ti = CultureInfo.CurrentCulture.TextInfo;
foreach (var value in values)
Console.WriteLine("{0} --> {1}", value, ti.ToTitleCase(value));
}
}
检查一下。https://msdn.microsoft.com/en-us/library/system.globalization.textinfo.totitlecase(v=vs.110).aspx