从文本中提取有意义的单词

时间:2015-04-13 23:34:01

标签: c# html search-engine token

我想使用C#中的正则表达式函数从html页面内容中提取所有有意义的单词以进行标记化,这就是我所做的但仍然有垃圾,我怎么能这样做?

    //Remove Html tags
        content = Regex.Replace(content, @"<.*?>", " ");

        //Decode Html characters
        content = HttpUtility.HtmlDecode(content);

        //Remove everything but letters, numbers and whitespace characters
        content = Regex.Replace(content, @"[^\w\s]", string.Empty);

        //Remove multiple whitespace characters
        content = Regex.Replace(content, @"\s+", " ");

        //remove any digits
        content = Regex.Replace(content, @"[\d-]"," ");

        //remove words less than 2 and more than 20 length
        content = Regex.Replace(content, @"\b\w{2,20}\b", string.Empty);

1 个答案:

答案 0 :(得分:1)

使用RegEx进行HTML处理通常比它的价值更麻烦。抓住HtmlAgilityPack并使用它来遍历HTML DOM,提取文本节点内的任何内容。您可以使用类似于下面类的内容来收集HTML字符串中的所有文本块。

public sealed class HtmlTextExtractor
{
    private readonly string m_html;

    public HtmlTextExtractor(string html)
    {
        m_html = html;
    }

    public IEnumerable<string> GetTextBlocks()
    {
        var htmlDocument = new HtmlDocument();
        htmlDocument.LoadHtml(m_html);

        var text = new List<string>();
        WalkNode(htmlDocument.DocumentNode, text);

        return text;
    }

    private void WalkNode(HtmlNode node, List<string> text)
    {
        switch (node.NodeType)
        {
                case HtmlNodeType.Comment:
                    break; // Exclude comments?

                case HtmlNodeType.Document:
                case HtmlNodeType.Element:
                    {
                        if (node.HasChildNodes)
                        {                   
                            foreach (var childNode in node.ChildNodes)
                                WalkNode(childNode, text);
                        }
                    }
                    break;

            case HtmlNodeType.Text:
                {
                    var html = ((HtmlTextNode)node).Text;
                    if (html.Length <= 0)
                        break;

                    var cleanHtml = HtmlEntity.DeEntitize(html).Trim();
                    if (!string.IsNullOrEmpty(cleanHtml))
                        text.Add(cleanHtml);
                }
                break;
        }
    }
}

然后,您可以专注于分割/标记文本。

var extractor = new HtmlTextExtractor(html);
var textBlocks = extractor.GetTextBlocks();

var words = new List<string>();
foreach (var textBlock in textBlocks)
{
    words.AddRange(textBlock.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries));
}

var distinctWords = words.Select(word => CleanWord(word))
    .Where(word => word.Length > 2 && word.Length < 20 && !string.IsNullOrEmpty(word))
    .Distinct()
    .OrderBy(word => word);

最后清理单个单词或代币。

public string CleanWord(string word)
{
    //Remove everything but letters, numbers and whitespace characters
    word = Regex.Replace(word, @"[^\w\s]", string.Empty);

    //Remove multiple whitespace characters
    word = Regex.Replace(word, @"\s+", " ");

    //remove any digits
    word = Regex.Replace(word, @"[\d-]"," ");

    return word.Trim();
}

显然,这是可以想象的最简单的实现。它非常原始,在非英语语言中不能很好地工作,不会在空格中分裂,不能很好地处理标点符号等,但它应该让你了解各个部分。您可以查看Lucene.NET之类的内容来改进您的标记化,如果您想改进实现,可能还有更多的库可用。