Question

我试图抓取一个网页来获取文本。我将每个单词放入字典中，并计算每个单词在页面上出现的次数。我按照这篇文章的建议尝试使用HTML Agility Pack：How to get number of words on a web page?

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
int wordCount = 0;
Dictionary<string, int> dict = new Dictionary<string, int>();

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
    MatchCollection matches = Regex.Matches(node.InnerText, @"\b(?:[a-z]{2,}|[ai])\b", RegexOptions.IgnoreCase);
    foreach (Match s in matches)
    {
       //Add the entry to the dictionary
    }
}

但是，根据我目前的实施情况，我仍然会从标记中获得大量不应计算的结果。它已经接近了，但还没有完全（我不希望它是完美的）。

我以this page为例。我的结果显示了很多用词＆＃34; width＆＃34;和＃34; googletag＆＃34;，尽管那些根本不在页面的实际文本中。

有关如何解决此问题的任何建议？谢谢！

Answer 1

您无法确定是否向用户显示您要搜索的单词，因为会有JS执行和CSS规则影响该用户。

以下程序找到“width”和“googletag”的0个匹配项，但找到126个“html”匹配项，而Chrome Ctrl + F 找到106场比赛。

请注意，如果该节点的父节点为<script>，则该节目与该字词不匹配。

using HtmlAgilityPack;
using System;

namespace WordCounter
{
    class Program
    {
        private static readonly Uri Uri = new Uri("https://www.w3schools.com/html/html_editors.asp");

        static void Main(string[] args)
        {
            var doc = new HtmlWeb().Load(Uri);
            var nodes = doc.DocumentNode.SelectSingleNode("//body").DescendantsAndSelf();
            var word = Console.ReadLine().ToLower();
            while (word != "exit")
            {
                var count = 0;
                foreach (var node in nodes)
                {
                    if (node.NodeType == HtmlNodeType.Text && node.ParentNode.Name != "script" && node.InnerText.ToLower().Contains(word))
                    {
                        count++;
                    }
                }

                Console.WriteLine($"{word} is displayed {count} times.");
                word = Console.ReadLine().ToLower();
            }
        }
    }
}

仅使用HTML Agility Pack获取网页文本？

1 个答案: