阅读长串的前3段。 [C#,HTML AgilityPack]

时间:2010-07-23 10:56:50

标签: c# asp.net

我想从长字符串中读取并输出字符串的前3个段落。我该如何实现这一目标?我想使用这段代码来显示(n)单词的数量,但后来我改为段落。

public string MySummary(string html, int max)
{
    string summaryHtml = string.Empty;

    // load our html document
    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(html);

    int wordCount = 0;




    foreach (var element in htmlDoc.DocumentNode.ChildNodes)
    {
        // inner text will strip out all html, and give us plain text
        string elementText = element.InnerText;

        // we split by space to get all the words in this element
        string[] elementWords = elementText.Split(new char[] { ' ' });

        // and if we haven't used too many words ...

        if (wordCount <= max)
        {
            // add the *outer* HTML (which will have proper 
            // html formatting for this fragment) to the summary
            summaryHtml += element.OuterHtml;
            wordCount += elementWords.Count() + 1;

        }
        else
        {
            break;
        }
    }

    return summaryHtml ;
}

5 个答案:

答案 0 :(得分:2)

如果您的段落是<p>标记,请获取文档的所有子节点<p>并拉出前3个内部文本?

修改评论:

RTFM?

http://htmlagilitypack.codeplex.com/wikipage?title=Examples&referringTitle=Home

类似的东西:

string.Join(doc.DocumentElement.SelectNodes("//p").Take(3).Select(n => n.Text).ToArray(), " ");

答案 1 :(得分:0)

答案 2 :(得分:0)

我必须自己做这件事并提出一种非常简单但宽容的方式,这对我们的特定情况很有效:

    public string GetParagraphs(string html, int numberOfParagraphs)
    {
        const string paragraphSeparator = "</p>";
        var paragraphs = html.Split(new[] { paragraphSeparator }, StringSplitOptions.RemoveEmptyEntries);
        return string.Join("", paragraphs.Take(numberOfParagraphs).Select(paragraph => paragraph + paragraphSeparator));
    }

我意识到这对于文档的结构是多么天真,它也会在<p>之间得到任何非<p>标签,但在我的用例中实际上是我想要的 - 也许会也为你工作?

答案 3 :(得分:0)

这是更好的答案。但是如果我们想把段落从2到5,那么将编码。

public string GetParagraphs(string html, int numberOfParagraphs) {
    const string paragraphSeparator = "</p>";
    var paragraphs = html.Split(new[] { paragraphSeparator }, StringSplitOptions.RemoveEmptyEntries);
    return string.Join("", paragraphs.Take(numberOfParagraphs).Select(paragraph => paragraph + paragraphSeparator));
}

答案 4 :(得分:0)

你必须使用HtmlAgilityPack。

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(HtmlContent);

string Html = string.Join(" ", doc.DocumentNode.SelectNodes("//p").Take(2).Select(n => n.OuterHtml).ToArray());

string Html = string.Join(" ", doc.DocumentNode.SelectNodes("//p").Take(2).Select(n => n.OuterHtml).ToArray());