Question

我解析html（在c＃代码中作为字符串）并需要从html获取所有短语。例如html：

<div><div>text1</div>text2</div>

我想获得字符串数组：

的text1
text2的

如果无法使用正则表达式，请提供算法如何跳过所有标记名称，标记属性并仅获取文本内容。

更新：它不是跨度问题的公告，因为文本可以在任何标记中，而不仅仅是跨度。我需要所有文本，标签和属性除外。不想使用HtmlAgility解析器。

Update2：发现正则表达式（是的，有可能）

    //parse html, save text node in list
    public void FindTextHtml(string html, List<string> list)
    {
        var ms = Regex.Matches(html, @">([^<>]*)<", RegexOptions.IgnoreCase | RegexOptions.Multiline);
        foreach (Match m in ms)
        {
            var text = m.Groups[1].Value;
            list.Add(text);
        }
    }

完整源代码here

Answer 1

您在寻找的是：Grabbing HTML Tags

您要查找的匹配项将位于...（。*？）...组中。希望这有帮助

Answer 2

使用HtmlAgilityPack dll解析XML和HTML文件，然后使用下面的代码获取文字：

        string path = @"path to the file";
        HtmlAgilityPack.HtmlDocument hd = new HtmlAgilityPack.HtmlDocument();
        hd.Load(path);
        string result= hd.DocumentNode.InnerText.Trim();

这就是你所需要的一切

需要使用正则表达式来查找html

2 个答案: