我尝试了两种方法来获取HTML Agility Pack的HTML页面中的文本:
方法1
var root = doc.DocumentNode;
foreach (HtmlNode node in root.SelectNodes("//text()"))
{
sb.AppendLine(node.InnerText.Trim() + " ");
}
方法2
var root = doc.DocumentNode;
foreach (var node in root.DescendantsAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
sb.AppendLine(text.Trim() + " ");
}
}
如果页面中存在</form>
标记,则这两个标记都会留下。例如,这是www.google.com:
"body": " Search Images Maps Play YouTube News Gmail Drive More Calendar
Translate Mobile Books Wallet Shopping Blogger Finance Photos Videos Docs
Even more » Account Options Sign in Search settings Web History
× Try a fast, secure browser with updates built in. Yes, get Chrome
now Advanced search Language tools </form> Advertising Programs
Business Solutions +Google About Google © 2016 - Privacy - Terms "
是什么给出了?
编辑:“只是文字”我的意思是“语言文字”....所以:
<i>book reports</i>
变为book reports
<a href="...">More Details</a>
变为More Details
<div>Check out our <b>deals</b>!</div>
变为Check out our deals!
答案 0 :(得分:0)
请在发布前搜索您的问题
Using C# regular expressions to remove HTML tags
从此网页提取的示例
String result = Regex.Replace(htmlDocument, @"<[^>]*>", String.Empty);
或者如果你想使用敏捷(也从网页上拉)
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(Properties.Resources.HtmlContents);
var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText);
StringBuilder output = new StringBuilder();
foreach (string line in text)
{
output.AppendLine(line);
}
string textOnly = HttpUtility.HtmlDecode(output.ToString());