如何在第一次出现字符串时停止,多次使用

时间:2015-05-18 09:00:32

标签: c# .net regex

我目前正在编写一个脚本来解析HTML文档中的一些内容。

以下是我正在解析的代码示例:

<div class="tab-content">
<div class="tab-pane fade in active" id="how-to-take">
<div class="panel-body">
<h3>What is Pantoprazole?</h3>
Pantoprazole is a generic drug used to treat certain conditions where there is too much acid in the stomach. It is
used to treat gastric and duodenal ulcers, erosive esophagitis, and gastroesophageal reflux disease (GERD). GERD is
a condition where the acid in the stomach washes back up into the esophagus. <br/> Pantoprazole is a proton pump
inhibitor (PPI). It works by decreasing the amount of acid produced by the stomach.
<h3>How To Take</h3>
Take the tablets 1 hour before a meal without chewing or breaking them and swallow them whole with some water
</div>
</div>
<div class="tab-pane fade" id="alternative-treatments">
<div class="panel-body">
<h3>Alternatives</h3>
Antacids taken as required Antacids are alkali liquids or tablets
that can neutralise the stomach acid. A dose may give quick relief.
There are many brands which you can buy. You can also get some on
prescription. If you have mild or infrequent bouts of dyspepsia you
may find that antacids used as required are all that you need.<br/>
</div>
</div>
<div class="tab-pane fade" id="side-effects">
<div class="panel-body">
<p>Most people who take acid reflux medication do not have any side-effects.
However, side-effects occur in a small number of users. The most
common side-effects are:</p>
<ul>

我正在尝试解析之间的所有内容:

<div class="tab-pane fade in active" id="how-to-take">
<div class="panel-body">

</div>

我写了以下正则表达式代码:

<div class="tab-pane fade in active" id="how-to-take">\n<div class="panel-body">\n(.*?[\s\S]+)\n(?:<\/div>)

也尝试过:

<div class="tab-pane fade in active" id="how-to-take">\n<div class="panel-body">\n(.*?[\s\S]+)\n<\/div>

但它似乎没有在第一个<\/div>停止,直到代码中的最终<div>为止。

2 个答案:

答案 0 :(得分:3)

Don't use regex to parse HTML。您可以使用HtmlAgilityPack

然后按预期工作:

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(File.ReadAllText("Path"));
var divPanelBody = doc.DocumentNode.SelectSingleNode("//div[@class='panel-body']");
string text = divPanelBody.InnerText.Trim();  // null check omitted

结果:

  

什么是泮托拉唑?泮托拉唑是一种用于治疗的通用药物   胃中酸含量过高的某些情况。它是   用于治疗胃和十二指肠溃疡,糜烂性食管炎,和   胃食管反流病(GERD)。 GERD是一个条件   胃中的酸液回到食道。泮托拉唑   是一种质子泵抑制剂(PPI)。它的工作原理是减少量   胃产生的酸。如何服用1小时服用片剂   饭前没有咀嚼或打破它们并吞下它们   用一些水

这是另一种LINQ方法,我更喜欢XPath语法:

var divPanelBody = doc.DocumentNode.Descendants("div")
    .FirstOrDefault(d => d.GetAttributeValue("class", "") == "panel-body");

请注意,这两种方法都区分大小写,因此找不到Panel-Body。您可以轻松地使最后一种方法不区分大小写:

var divPanelBody = doc.DocumentNode.Descendants("div")
    .FirstOrDefault(d => d.GetAttributeValue("class", "").Equals("panel-body", StringComparison.InvariantCultureIgnoreCase));

答案 1 :(得分:0)

您可以使用HtmlAgilityPack

轻松完成此操作
public string GetInnerHtml(string html)
{
      HtmlDocument doc = new HtmlDocument();
      doc.LoadHtml(html);
      var nodes = doc.DocumentNode.SelectNodes("//div[@class=\"panel-body\"]");
      StringBuilder sb = new StringBuilder();
      foreach (var n in nodes)
      {
            sb.Append(n.InnerHtml);
      }
      return sb.ToString();
}