Question

我认为它会像（伪代码）：

var pars = new List<string>();
string par;
while (not eof("Platypus.html"))
{
    par = getNextParagraph();
    pars.Add(par);
}

...其中getNextParagraph（）查找下一个"<p>"并继续直到找到"</p>"，烧掉它后面的桥梁（“剪切”段落，以便不会一遍又一遍地找到它再次）。或者其他一些。

是否有人了解如何做到这一点/更好的方法？

更新

我试图使用Aurelien Souchet的代码。

我有以下用途：

using HtmlAgilityPack;
using HtmlDocument = System.Windows.Forms.HtmlDocument;

...但是这段代码：

HtmlDocument doc = new HtmlDocument();

是不受欢迎的（“无法在此处访问私人构造函数'HtmlDocument'”）

此外，“doc.LoadHtml（）”和“doc.DocumentNode”都给出旧的“无法解析符号'Bla'”错误信息

更新2

好的，我必须先加上“HtmlAgilityPack”。所以模棱两可的参考被消除歧义。

Answer 1

正如人们在评论中所说，我认为HtmlAgilityPack是最好的选择，它易于使用，并且可以找到好的示例或教程。

以下是我要写的内容：

//don't forgot to add the reference
using HtmlAgilityPack;

//Function that takes the html as a string in parameter and return a list
//of strings with the paragraphs content.
public List<string> GetParagraphsListFromHtml(string sourceHtml)
{

   var pars = new List<string>();

   //first create an HtmlDocument
   HtmlDocument doc = new HtmlDocument();

   //load the html (from a string)
   doc.LoadHtml(sourceHtml);

   //Select all the <p> nodes in a HtmlNodeCollection
   HtmlNodeCollection paragraphs = doc.DocumentNode.SelectNodes(".//p");

   //Iterates on every Node in the collection
   foreach (HtmlNode paragraph in paragraphs)
   {
      //Add the InnerText to the list
      pars.Add(paragraph.InnerText); 
      //Or paragraph.InnerHtml depends what you want
   }

   return pars;
}

这只是一个基本的例子，你可以在你的html中有一些嵌套段落然后这段代码可能会按预期工作，这一切都取决于你正在解析的html以及你想要做什么它

希望它有所帮助！

如何一次读取一个段落的HTML文件？

更新

更新2

1 个答案: