Html Agility Pack文本标签保持

时间:2016-04-11 16:54:24

标签: c# html html-agility-pack

我尝试了两种方法来获取HTML Agility Pack的HTML页面中的文本:

方法1

var root = doc.DocumentNode;

foreach (HtmlNode node in root.SelectNodes("//text()"))
{
    sb.AppendLine(node.InnerText.Trim() + " ");
}

方法2

var root = doc.DocumentNode;
foreach (var node in root.DescendantsAndSelf())
{
    if (!node.HasChildNodes)
    {
        string text = node.InnerText;
        if (!string.IsNullOrEmpty(text))
            sb.AppendLine(text.Trim() + " ");
    }
}

如果页面中存在</form>标记,则这两个标记都会留下。例如,这是www.google.com:

"body": " Search Images Maps Play YouTube News Gmail Drive More Calendar
Translate Mobile Books Wallet Shopping Blogger Finance Photos Videos Docs 
Even more &raquo; Account Options Sign in Search settings Web History 
&times; Try a fast, secure browser with updates built in. Yes, get Chrome 
now &nbsp; Advanced search Language tools </form> Advertising Programs 
Business Solutions +Google About Google &copy; 2016 - Privacy - Terms "

是什么给出了?

编辑:“只是文字”我的意思是“语言文字”....所以:

<i>book reports</i>变为book reports

<a href="...">More Details</a>变为More Details

<div>Check out our <b>deals</b>!</div>变为Check out our deals!

1 个答案:

答案 0 :(得分:0)

请在发布前搜索您的问题

Using C# regular expressions to remove HTML tags

从此网页提取的示例

String result = Regex.Replace(htmlDocument, @"<[^>]*>", String.Empty);

或者如果你想使用敏捷(也从网页上拉)

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(Properties.Resources.HtmlContents);
var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText);
StringBuilder output = new StringBuilder();
foreach (string line in text)
{
   output.AppendLine(line);
}
string textOnly = HttpUtility.HtmlDecode(output.ToString());