大家好我有字符串<span class="lnk">Участники <span class="clgry">59728</span></span>
我解析它
string population = Regex.Match(content, @"Участники <span class=""clgry"">(?<id>[^""]+?)</span>").Groups["id"].Value;
int j = 0;
if (!string.IsNullOrEmpty(population))
{
log("[+] Группа: " + group + " Учасники: " + population + "\r\n");
int population_int = Convert.ToInt32(population);
if (population_int > 20000)
{
lock (accslocker)
{
StreamWriter file = new StreamWriter("opened.txt", true);
file.Write(group + ":" + population + "\r\n");
file.Close();
}
j++;
}
}
但是当我的字符串是><span class="lnk">Участники <span class="clgry"></span></span>
时,我收到一个例子“输入字符串的格式不正确”。
如何避免呢?
答案 0 :(得分:2)
而不是正则表达式使用真正的HTML解析器来解析htmls。 (例如,HtmlAgilityPack)
string html = @"<span class=""lnk"">Участники <span class=""clgry"">59728</span>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var list = doc.DocumentNode.SelectNodes("//span[@class='lnk']/span[@class='clgry']")
.Select(x => new
{
ParentText = x.ParentNode.FirstChild.InnerText,
Text = x.InnerText
})
.ToList();
答案 1 :(得分:1)
尝试用正则表达式解析html内容不是一个好的决定。见this。请改用Html Agliliy Pack。
var spans = doc.DocumentNode.Descendants("span")
.Where(s => s.Attributes["class"].Value == "clgry")
.Select(x => x.InnerText)
.ToList();