从html输出前两段作为字符串存储

时间:2010-01-13 17:26:21

标签: c# .net html string substring

我的html存储在我的c#.net 2.0代码中的字符串变量中。以下是一个例子:

<div class="track">
    <img alt="" src="http://hits.guardian.co.uk/b/ss/guardiangu-feeds/1/H.20.3/30561?ns=guardian&pageName=Hundreds+feared+dead+in+Haiti+quake%3AArticle%3A1336252&ch=World+news&c3=GU.co.uk&c4=Haiti+%28News%29%2CDominican+Republic+%28News%29%2CCuba+%28News%29%2CBahamas+%28News%29%2CNatural+disasters+and+extreme+weather+%28News%29%2CEnvironment%2CWorld+news&c6=Rory+Carroll%2CHaroon+Siddique&c7=10-Jan-13&c8=1336252&c9=Article&c10=News&c11=World+news&c13=&c25=&c30=content&h2=GU%2FWorld+news%2FHaiti" width="1" height="1" />
</div>
<p class="standfirst">
    • Tens of thousands lose homes in 7.0 magnitude quake<br />
    • UN headquarters, schools and hospitals collapse
</p>
<p>
    René Préval, the president of Haiti, has described the devastation after last night's earthquake as "unimaginable" as governments and aid agencies around the world rushed into action.
</p>
<p>
    Préval described how he had been forced to step over dead bodies and heard the cries of those trapped under the rubble of the national parliament. "Parliament has collapsed. The tax office has collapsed. Schools have collapsed. Hospitals have collapsed," <a href="http://www.miamiherald.com/582/story/1422279.html" title="he told the Miami Herald">he told the Miami Herald</a>. "There are a lot of schools that have a lot of dead people in them." Préval said he thought thousands of people had died in the quake.
</p>

我只想输出前两段作为原始的子串。

有人可以帮忙吗?

4 个答案:

答案 0 :(得分:4)

我最终使用了这个功能......

  private string GetFirstParagraph(string htmltext)
        {
            Match m = Regex.Match(htmltext, @"<p>\s*(.+?)\s*</p>");
            if (m.Success)
            {
                return m.Groups[1].Value;
            }
            else
            {
                return htmltext;
            }
        }

答案 1 :(得分:3)

查看Html Agility Pack

它公开了一个非常强大的API,用于解析HTML,可用于提取所需的数据。

答案 2 :(得分:0)

您使用的是JavaScript吗?您可以在p标签上使用explode来获取数组的一个div中的div + first para,以及它们各自元素中的每个p标记。

答案 3 :(得分:0)

您可以编写一些方法将HTML加载到webbrowser变量中,然后使用DOM遍历节点并提取您想要的自定义逻辑。看看这个tutorial

以下是如何在后面的代码中创建webbroswer的片段,而不是教程如何告诉您如何操作:

using System.Windows.Forms;

WebBrowser _Browser = null;
string _Source = "Your HTML goes here";

_Browser = new WebBrowser();
_Browser.Navigate("about:Blank");
_Browser.Document.OpenNew(true);
_Browser.Document.Write(_Source);