Question

I'm trying to scrape website that has a "pre" tag using the HTML Agility Pack in C#. I can find plenty of "table tr td" examples but cannot find any "pre"examples. Here is my code with the formatted text "pre" inline.

private void PreformattedTextButton_Click(object sender, EventArgs e)
    {
        var url = @"http://www.thepredictiontracker.com/basepred.php";
        var data = new MyWebClient().DownloadString(url);
        var doc = new HtmlAgilityPack.HtmlDocument();
        doc.LoadHtml(data);

        //            m _        a _        e d     d d     d d     d l     n
        //e       h d       v r    1     2     3     4     5     6     2     s

        //  BAL D.BUNDY TAM C.ARCHER     7.5  7.48  8.08  7.00  5.58  4.70.     .    6.46
        //  CIN H.BAILEY ATL S.NEWCOMB    9.0  9.72 10.08 10.00 11.62 11.51.     .   10.73

        foreach (HtmlNode pre in doc.DocumentNode.SelectNodes("//pre"))
        {
            textBox1.Text += pre.InnerText + System.Environment.NewLine;
        }
    }

I want to capture the lines that look like the 3rd and 4th lines ignoriing the preceeding lines.

The foreach is executed, but it has pre.InnerText.Length of 1642 which is the total of the pre-formatted text. I want to capture a line of data. e.g. Line 3 & 4.

Answer 1

根据定义<pre>标签是预先格式化的文本，因此您需要自己解析InnerText属性。您上面提供的示例格式一致，因此将InnerText拆分为一组行，然后使用Regex捕获所需的行。 经过测试和运作的代码示例：

var url = @"http://www.thepredictiontracker.com/basepred.php";
HtmlDocument doc = new HtmlWeb().Load(url);
var regexMatch = new Regex(
    @"^\s*[A-Z]{3}\s+[A-Z]\.[A-Z]+\s+[A-Z]{3}", 
    RegexOptions.Compiled
);
foreach (HtmlNode pre in doc.DocumentNode.SelectNodes("//pre"))
{
    foreach (var line in pre.InnerText.Split(new char[] { '\r', '\n' }))
    {
        if (regexMatch.IsMatch(line))
        {
            Console.WriteLine(line.Trim());
        }
    }
}

HTML Agility Pack <pre> tag

1 个答案: