我想从长字符串中读取并输出字符串的前3个段落。我该如何实现这一目标?我想使用这段代码来显示(n)单词的数量,但后来我改为段落。
public string MySummary(string html, int max)
{
string summaryHtml = string.Empty;
// load our html document
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
int wordCount = 0;
foreach (var element in htmlDoc.DocumentNode.ChildNodes)
{
// inner text will strip out all html, and give us plain text
string elementText = element.InnerText;
// we split by space to get all the words in this element
string[] elementWords = elementText.Split(new char[] { ' ' });
// and if we haven't used too many words ...
if (wordCount <= max)
{
// add the *outer* HTML (which will have proper
// html formatting for this fragment) to the summary
summaryHtml += element.OuterHtml;
wordCount += elementWords.Count() + 1;
}
else
{
break;
}
}
return summaryHtml ;
}
答案 0 :(得分:2)
如果您的段落是<p>
标记,请获取文档的所有子节点<p>
并拉出前3个内部文本?
修改评论:
RTFM?
http://htmlagilitypack.codeplex.com/wikipage?title=Examples&referringTitle=Home
类似的东西:
string.Join(doc.DocumentElement.SelectNodes("//p").Take(3).Select(n => n.Text).ToArray(), " ");
答案 1 :(得分:0)
答案 2 :(得分:0)
我必须自己做这件事并提出一种非常简单但宽容的方式,这对我们的特定情况很有效:
public string GetParagraphs(string html, int numberOfParagraphs)
{
const string paragraphSeparator = "</p>";
var paragraphs = html.Split(new[] { paragraphSeparator }, StringSplitOptions.RemoveEmptyEntries);
return string.Join("", paragraphs.Take(numberOfParagraphs).Select(paragraph => paragraph + paragraphSeparator));
}
我意识到这对于文档的结构是多么天真,它也会在<p>
之间得到任何非<p>
标签,但在我的用例中实际上是我想要的 - 也许会也为你工作?
答案 3 :(得分:0)
这是更好的答案。但是如果我们想把段落从2到5,那么将编码。
public string GetParagraphs(string html, int numberOfParagraphs) {
const string paragraphSeparator = "</p>";
var paragraphs = html.Split(new[] { paragraphSeparator }, StringSplitOptions.RemoveEmptyEntries);
return string.Join("", paragraphs.Take(numberOfParagraphs).Select(paragraph => paragraph + paragraphSeparator));
}
答案 4 :(得分:0)
你必须使用HtmlAgilityPack。
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(HtmlContent);
string Html = string.Join(" ", doc.DocumentNode.SelectNodes("//p").Take(2).Select(n => n.OuterHtml).ToArray());
string Html = string.Join(" ", doc.DocumentNode.SelectNodes("//p").Take(2).Select(n => n.OuterHtml).ToArray());