我正在尝试用正则表达式解析the IMDb page(我知道HAP更好),但我的RegEx错了,所以你可以建议我如何正确使用HAP。
这是我试图解析的页面的一部分。我需要从这里拿2个号码:
<small>5 out of 5 people found the following review useful:</small>
<br>
<a href="/user/ur1174211/">
<h2>Interesting, Particularly in Comparison With "La Sortie des usines Lumière"</h2>
<b>Author:</b>
<a href="/user/ur1174211/">Snow Leopard</a>
<small>from Ohio</small>
<br>
<small>10 March 2005</small>
这是我在c#
上的代码Regex reg1 = new Regex("([0-9]+(out of)+[0-9])");
for (int i = 0; i < number; i++)
{
Console.WriteLine("the heading of the movie is {0}", header[i].InnerHtml);
Match m = reg1.Match(header[i].InnerHtml);
if (!m.Success)
{
return;
}
else
{
string str1 = m.Value.Split(' ')[0];
string str2 = m.Value.Split(' ')[3];
if (!Int32.TryParse(str1, out index1))
{
return;
}
if (!Int32.TryParse(str2, out index2))
{
return;
}
Console.WriteLine("index1 = {0}", index1);
Console.WriteLine("index2 = {0}", index2);
}
}
非常感谢所有阅读此内容的人。
答案 0 :(得分:2)
试试这个。这样你不仅可以取数字。
Regex reg1 = new Regex(@"(\d* (out of) \d*)");
for (int i = 0; i < number; i++)
{
Console.WriteLine("the heading of the movie is {0}", header[i].InnerHtml);
Match m = reg1.Match(header[i].InnerHtml);
if (!m.Success)
{
return;
}
else
{
Regex reg2 = new Regex(@"\d+");
m = reg2.Match(m.Value);
string str1 = m.Value;
string str2 = m.NextMatch().Value;
if (!Int32.TryParse(str1, out index1))
{
return;
}
if (!Int32.TryParse(str2, out index2))
{
return;
}
Console.WriteLine("index1 = {0}", index1);
Console.WriteLine("index2 = {0}", index2);
}
}
答案 1 :(得分:0)
如果您拥有small
标记的InnerHtml,那么也可以这样做以获取数字
var title = "5 out of 5 people found the following review useful:";
var titleNumbers = title.ToCharArray().Where(x => Char.IsNumber(x));
修改强>
正如@PulseLab建议的那样,我有另一种方法
var sd = s.Split(' ').Where((data) =>
{
var datum = 0;
int.TryParse(data, out datum);
return datum > 0;
}).ToArray();