c#安全地截断HTML文章摘要

时间:2009-11-11 12:00:55

标签: c# html regex

有人有c#变体吗?

这是我可以拿一些HTML并显示它而不会破坏作为文章的摘要引导?

Truncate text containing HTML, ignoring tags

让我免于重新发明轮子!

修改

对不起,这里有新的,你的权利,应该更好地表达这个问题,继承了更多的信息

我希望获取一个html字符串并将其截断为一定数量的单词(甚至是char长度),这样我就可以将它的开头显示为摘要(然后导致主要文章)。我希望保留html,以便我可以在预览中显示链接等。

我必须解决的主要问题是,如果我们在一个或多个标签的中间截断,我们最终可能会得到未关闭的html标签!

我对解决方案的想法是

  1. 首先将html截断为N个单词(单词更好但chars ok)(确保不要停留在标记的中间并截断require属性)

  2. 通过这个截断的字符串中的已打开的html标签(可能会在我去的时候将它们粘在堆栈上?)

  3. 然后处理结束标记并确保它们与我在弹出它们时的堆栈匹配?

  4. 如果此后有任何打开的标签留在堆栈上,那么将它们写入截断字符串的末尾,html应该是好的!!!!

  5. 编辑12/11/2009

    • 这就是我在VS2008中的单元测试文件中偶然发现的错误,这可能会帮助将来的某个人
    • 基于Jan代码的黑客尝试在char版本+ word版本中排名靠前(免责声明:这是我粗糙的代码!!)
    • 我假设在所有情况下使用'结构良好'的HTML(但不一定是根据XML版本具有根节点的完整文档)
    • Abels XML版本已经到底,但还没有完全让测试在此上运行(还需要了解代码)......
    • 当我有机会改进时,我会更新
    • 发布代码时遇到问题?堆栈上没有上传工具吗?

    感谢所有评论:)

    using System;
    using System.Collections.Generic;
    using System.Text.RegularExpressions;
    using System.Xml;
    using System.Xml.XPath;
    using Microsoft.VisualStudio.TestTools.UnitTesting;
    
    namespace PINET40TestProject
    {
        [TestClass]
        public class UtilityUnitTest
        {
            public static string TruncateHTMLSafeishChar(string text, int charCount)
            {
                bool inTag = false;
                int cntr = 0;
                int cntrContent = 0;
    
                // loop through html, counting only viewable content
                foreach (Char c in text)
                {
                    if (cntrContent == charCount) break;
                    cntr++;
                    if (c == '<')
                    {
                        inTag = true;
                        continue;
                    }
    
                    if (c == '>')
                    {
                        inTag = false;
                        continue;
                    }
                    if (!inTag) cntrContent++;
                }
    
                string substr = text.Substring(0, cntr);
    
                //search for nonclosed tags        
                MatchCollection openedTags = new Regex("<[^/](.|\n)*?>").Matches(substr);
                MatchCollection closedTags = new Regex("<[/](.|\n)*?>").Matches(substr);
    
                // create stack          
                Stack<string> opentagsStack = new Stack<string>();
                Stack<string> closedtagsStack = new Stack<string>();
    
                // to be honest, this seemed like a good idea then I got lost along the way 
                // so logic is probably hanging by a thread!! 
                foreach (Match tag in openedTags)
                {
                    string openedtag = tag.Value.Substring(1, tag.Value.Length - 2);
                    // strip any attributes, sure we can use regex for this!
                    if (openedtag.IndexOf(" ") >= 0)
                    {
                        openedtag = openedtag.Substring(0, openedtag.IndexOf(" "));
                    }
    
                    // ignore brs as self-closed
                    if (openedtag.Trim() != "br")
                    {
                        opentagsStack.Push(openedtag);
                    }
                }
    
                foreach (Match tag in closedTags)
                {
                    string closedtag = tag.Value.Substring(2, tag.Value.Length - 3);
                    closedtagsStack.Push(closedtag);
                }
    
                if (closedtagsStack.Count < opentagsStack.Count)
                {
                    while (opentagsStack.Count > 0)
                    {
                        string tagstr = opentagsStack.Pop();
    
                        if (closedtagsStack.Count == 0 || tagstr != closedtagsStack.Peek())
                        {
                            substr += "</" + tagstr + ">";
                        }
                        else
                        {
                            closedtagsStack.Pop();
                        }
                    }
                }
    
                return substr;
            }
    
            public static string TruncateHTMLSafeishWord(string text, int wordCount)
            {
                bool inTag = false;
                int cntr = 0;
                int cntrWords = 0;
                Char lastc = ' ';
    
                // loop through html, counting only viewable content
                foreach (Char c in text)
                {
                    if (cntrWords == wordCount) break;
                    cntr++;
                    if (c == '<')
                    {
                        inTag = true;
                        continue;
                    }
    
                    if (c == '>')
                    {
                        inTag = false;
                        continue;
                    }
                    if (!inTag)
                    {
                        // do not count double spaces, and a space not in a tag counts as a word
                        if (c == 32 && lastc != 32)
                            cntrWords++;
                    }
                }
    
                string substr = text.Substring(0, cntr) + " ...";
    
                //search for nonclosed tags        
                MatchCollection openedTags = new Regex("<[^/](.|\n)*?>").Matches(substr);
                MatchCollection closedTags = new Regex("<[/](.|\n)*?>").Matches(substr);
    
                // create stack          
                Stack<string> opentagsStack = new Stack<string>();
                Stack<string> closedtagsStack = new Stack<string>();
    
                foreach (Match tag in openedTags)
                {
                    string openedtag = tag.Value.Substring(1, tag.Value.Length - 2);
                    // strip any attributes, sure we can use regex for this!
                    if (openedtag.IndexOf(" ") >= 0)
                    {
                        openedtag = openedtag.Substring(0, openedtag.IndexOf(" "));
                    }
    
                    // ignore brs as self-closed
                    if (openedtag.Trim() != "br")
                    {
                        opentagsStack.Push(openedtag);
                    }
                }
    
                foreach (Match tag in closedTags)
                {
                    string closedtag = tag.Value.Substring(2, tag.Value.Length - 3);
                    closedtagsStack.Push(closedtag);
                }
    
                if (closedtagsStack.Count < opentagsStack.Count)
                {
                    while (opentagsStack.Count > 0)
                    {
                        string tagstr = opentagsStack.Pop();
    
                        if (closedtagsStack.Count == 0 || tagstr != closedtagsStack.Peek())
                        {
                            substr += "</" + tagstr + ">";
                        }
                        else
                        {
                            closedtagsStack.Pop();
                        }
                    }
                }
    
                return substr;
            }
    
            public static string TruncateHTMLSafeishCharXML(string text, int charCount)
            {
                // your data, probably comes from somewhere, or as params to a methodint 
                XmlDocument xml = new XmlDocument();
                xml.LoadXml(text);
                // create a navigator, this is our primary tool
                XPathNavigator navigator = xml.CreateNavigator();
                XPathNavigator breakPoint = null;
    
                // find the text node we need:
                while (navigator.MoveToFollowing(XPathNodeType.Text))
                {
                    string lastText = navigator.Value.Substring(0, Math.Min(charCount, navigator.Value.Length));
                    charCount -= navigator.Value.Length;
                    if (charCount <= 0)
                    {
                        // truncate the last text. Here goes your "search word boundary" code:        
                        navigator.SetValue(lastText);
                        breakPoint = navigator.Clone();
                        break;
                    }
                }
    
                // first remove text nodes, because Microsoft unfortunately merges them without asking
                while (navigator.MoveToFollowing(XPathNodeType.Text))
                {
                    if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
                    {
                        navigator.DeleteSelf();
                    }
                }
    
                // moves to parent, then move the rest
                navigator.MoveTo(breakPoint);
                while (navigator.MoveToFollowing(XPathNodeType.Element))
                {
                    if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
                    {
                        navigator.DeleteSelf();
                    }
                }
    
                // moves to parent
                // then remove *all* empty nodes to clean up (not necessary):
                // TODO, add empty elements like <br />, <img /> as exclusion
                navigator.MoveToRoot();
                while (navigator.MoveToFollowing(XPathNodeType.Element))
                {
                    while (!navigator.HasChildren && (navigator.Value ?? "").Trim() == "")
                    {
                        navigator.DeleteSelf();
                    }
                }
    
                // moves to parent
                navigator.MoveToRoot();
                return navigator.InnerXml;
            }
    
            [TestMethod]
            public void TestTruncateHTMLSafeish()
            {
                // Case where we just make it to start of HREF (so effectively an empty link)
    
                // 'simple' nested none attributed tags
                Assert.AreEqual(@"<h1>1234</h1><b><i>56789</i>012</b>",
                TruncateHTMLSafeishChar(
                    @"<h1>1234</h1><b><i>56789</i>012345</b>",
                    12));
    
                // In middle of a!
                Assert.AreEqual(@"<h1>1234</h1><a href=""testurl""><b>567</b></a>",
                TruncateHTMLSafeishChar(
                    @"<h1>1234</h1><a href=""testurl""><b>5678</b></a><i><strong>some italic nested in string</strong></i>",
                    7));
    
                // more
                Assert.AreEqual(@"<div><b><i><strong>1</strong></i></b></div>",
                TruncateHTMLSafeishChar(
                    @"<div><b><i><strong>12</strong></i></b></div>",
                    1));
    
                // br
                Assert.AreEqual(@"<h1>1 3 5</h1><br />6",
                TruncateHTMLSafeishChar(
                    @"<h1>1 3 5</h1><br />678<br />",
                    6));
            }
    
            [TestMethod]
            public void TestTruncateHTMLSafeishWord()
            {
                // zero case
                Assert.AreEqual(@" ...",
                                TruncateHTMLSafeishWord(
                                    @"",
                                   5));
    
                // 'simple' nested none attributed tags
                Assert.AreEqual(@"<h1>one two <br /></h1><b><i>three  ...</i></b>",
                TruncateHTMLSafeishWord(
                    @"<h1>one two <br /></h1><b><i>three </i>four</b>",
                    3), "we have added ' ...' to end of summary");
    
                // In middle of a!
                Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four  ...</b></a>",
                TruncateHTMLSafeishWord(
                    @"<h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four five </b></a><i><strong>some italic nested in string</strong></i>",
                    4));
    
                // start of h1
                Assert.AreEqual(@"<h1>one two three  ...</h1>",
                TruncateHTMLSafeishWord(
                    @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
                    3));
    
                // more than words available
                Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i> ...",
                TruncateHTMLSafeishWord(
                    @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
                    99));
            }
    
            [TestMethod]
            public void TestTruncateHTMLSafeishWordXML()
            {
                // zero case
                Assert.AreEqual(@" ...",
                                TruncateHTMLSafeishWord(
                                    @"",
                                   5));
    
                // 'simple' nested none attributed tags
                string output = TruncateHTMLSafeishCharXML(
                    @"<body><h1>one two </h1><b><i>three </i>four</b></body>",
                    13);
                Assert.AreEqual(@"<body>\r\n  <h1>one two </h1>\r\n  <b>\r\n    <i>three</i>\r\n  </b>\r\n</body>", output,
                 "XML version, no ... yet and addeds '\r\n  + spaces?' to format document");
    
                // In middle of a!
                Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four  ...</b></a>",
                TruncateHTMLSafeishCharXML(
                    @"<body><h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four five </b></a><i><strong>some italic nested in string</strong></i></body>",
                    4));
    
                // start of h1
                Assert.AreEqual(@"<h1>one two three  ...</h1>",
                TruncateHTMLSafeishCharXML(
                    @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
                    3));
    
                // more than words available
                Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i> ...",
                TruncateHTMLSafeishCharXML(
                    @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
                    99));
            }
        }
    }
    

4 个答案:

答案 0 :(得分:11)

编辑:请参阅下面的完整解决方案,第一次尝试剥离HTML,第二次尝试

让我们总结一下你想要的东西:

  • 结果中没有HTML
  • 应该在<body>
  • 中包含任何有效数据
  • 它具有固定的最大长度

如果HTML是XHTML,这变得微不足道(虽然我没有看到PHP解决方案,但我很怀疑他们使用类似的方法,但我相信这是可以理解的而且相当容易):

XmlDocument xml = new XmlDocument();

// replace the following line with the content of your full XHTML
xml.LoadXml(@"<body><p>some <i>text</i>here</p><div>that needs stripping</div></body>");

// Get all textnodes under <body> (twice "//" is on purpose)
XmlNodeList nodes = xml.SelectNodes("//body//text()");

// loop through the text nodes, replace this with whatever you like to do with the text
foreach (var node in nodes)
{
    Debug.WriteLine(((XmlCharacterData)node).Value);
}

注意:将保留空格等。这通常是件好事。

如果你没有XHTML,你可以使用HTML Agility Pack,这可以让你对普通的旧HTML(内部将它转换为某个DOM)做同样的事情。我没试过,但它应该运行得相当顺利。


BIG EDIT:

实际解决方案

在一个小评论中,我承诺采用XHTML / XmlDocument方法,并将其用于基于文本长度拆分HTML的类型安全方法,但保留HTML代码。我使用了以下HTML,代码在needs中间正确分解,删除其余部分,删除空节点并自动关闭任何打开的元素。

示例HTML:

<body>
    <p><tt>some<u><i>text</i>here</u></tt></p>
    <div>that <b><i>needs <span>str</span>ip</i></b><s>ping</s></div>
</body>

代码,测试并使用任何类型的输入(好的,授予,我只是做了一些测试和代码可能包含错误,如果你找到它们,请告诉我们。)。

// your data, probably comes from somewhere, or as params to a method
int lengthAvailable = 20;
XmlDocument xml = new XmlDocument();
xml.LoadXml(@"place-html-code-here-left-out-for-brevity");

// create a navigator, this is our primary tool
XPathNavigator navigator = xml.CreateNavigator();
XPathNavigator breakPoint = null;


string lastText = "";

// find the text node we need:
while (navigator.MoveToFollowing(XPathNodeType.Text))
{
    lastText = navigator.Value.Substring(0, Math.Min(lengthAvailable, navigator.Value.Length));
    lengthAvailable -= navigator.Value.Length;

    if (lengthAvailable <= 0)
    {
        // truncate the last text. Here goes your "search word boundary" code:
        navigator.SetValue(lastText);
        breakPoint = navigator.Clone();
        break;
    }
}

// first remove text nodes, because Microsoft unfortunately merges them without asking
while (navigator.MoveToFollowing(XPathNodeType.Text))
    if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
        navigator.DeleteSelf();   // moves to parent

// then move the rest
navigator.MoveTo(breakPoint);
while (navigator.MoveToFollowing(XPathNodeType.Element))
    if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
        navigator.DeleteSelf();   // moves to parent

// then remove *all* empty nodes to clean up (not necessary): 
// TODO, add empty elements like <br />, <img /> as exclusion
navigator.MoveToRoot();
while (navigator.MoveToFollowing(XPathNodeType.Element))
    while (!navigator.HasChildren && (navigator.Value ?? "").Trim() == "")
        navigator.DeleteSelf();  // moves to parent

navigator.MoveToRoot();
Debug.WriteLine(navigator.InnerXml);

代码如何工作

代码按以下顺序执行以下操作:

  1. 它遍历所有文本节点,直到文本大小超出允许的限制,在这种情况下,它会截断该节点。这会自动正确处理&gt;等一个字符。
  2. 然后缩短“断开节点”的文本并重置它。它此时克隆了XPathNavigator因为我们需要记住这个“断点”。
  3. 要解决MS错误(实际上是一个古老的错误),我们必须首先删除任何剩余的文本节点,跟随断点,否则我们冒险自动合并文本节点最终成为彼此的兄弟姐妹。注意:DeleteSelf很方便,但会将导航器位置移动到其父级,这就是我们需要根据上一步中记住的“断点”位置检查当前位置的原因。
  4. 然后我们首先做我们想做的事情:删除任何节点跟随的断点。
  5. 不是必要的步骤:清理代码并删除任何空元素。此操作仅用于清理HTML和/或过滤特定(dis)允许的元素。它可以省略。
  6. 返回“root”并将内容作为包含InnerXml的字符串。
  7. 这就是全部,相当简单,虽然看起来有点令人生畏。

    PS:使用XSLT会更容易阅读和理解,这是此类工作的理想工具。

    更新: 根据已编辑的问题添加了扩展代码示例 更新:添加了一些解释

答案 1 :(得分:4)

如果你想维护html标签,你可以使用我最近发布的这个要点。 https://gist.github.com/2413598

它使用XmlReader / XmlWriter。它不是生产就绪的,即你可能想要SgmlReader或HtmlAgilityPack并且你想要尝试捕获并选择一些后备......

答案 2 :(得分:2)

确定。这应该工作(脏代码警报):

        string blah = "hoi <strong>dit <em>is test bla meer tekst</em></strong>";
        int aantalChars = 10;


        bool inTag = false;
        int cntr = 0;
        int cntrContent = 0;
        foreach (Char c in blah)
        {
            if (cntrContent == aantalChars) break;



            cntr++;
            if (c == '<')
            {
                inTag = true;
                continue;
            }
            else if (c == '>')
            {
                inTag = false;
                continue;
            }

            if (!inTag) cntrContent++;
        }

        string substr = blah.Substring(0, cntr);

        //search for nonclosed tags
        MatchCollection openedTags = new Regex("<[^/](.|\n)*?>").Matches(substr);
        MatchCollection closedTags = new Regex("<[/](.|\n)*?>").Matches(substr);

        for (int i =openedTags.Count - closedTags.Count; i >= 1; i--)
        {
            string closingTag = "</" + openedTags[closedTags.Count + i - 1].Value.Substring(1);
            substr += closingTag;
        }

答案 3 :(得分:0)

这很复杂,据我所知,PHP解决方案都不是完美的。如果文字是:

怎么办?
substr("Hello, my <strong>name is <em>Sam</em>. I&acute;m a 
  web developer.  And this text is very long and all the text 
  is inside the sam html tag..</strong>",0,26)."..."

您实际上必须遍历整个文本才能找到起始 strong -tag的结尾。

我的建议是删除摘要中的所有html。 如果您要向用户显示自己的html代码,请务必使用html-sanitizing

祝你好运:)