我需要输出可能包含有效和/或无效HTML的消息中的所有明文,以及可能与HTML表面相似的文本(即<...>
中的非HTML文本,例如:< why would someone do this?? >
})。
保留所有非HTML内容比删除所有HTML更重要,但理想情况下,我希望尽可能多地删除HTML以提高可读性。
我目前正在使用HTML Agility Pack,但我遇到了<
和>
中的非HTML也被删除的问题,例如:
我的职能:
text = HttpUtility.HtmlDecode(text);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(text);
text = doc.DocumentNode.InnerText;
简单示例输入*:
this text has <b>weird < things</b> going on >
实际输出(不可接受,丢失了单词&#34;东西&#34;):
this text has weird going on >
期望的输出:
this text has weird < things going on >
是否有办法在HTML Agility Pack中仅删除合法的HTML标记,而不删除可能包含<
和/或>
的其他内容?或者我是否需要像this question中那样手动创建要删除的标记白名单?这是我的后备解决方案,但我希望有一个更完整的解决方案内置于HTML Agility Pack(或其他工具),而我却无法找到。
*(实际输入中通常有大量不需要的HTML,如果有用的话,我可以给出一个更长的例子)
答案 0 :(得分:1)
您可以使用此模式替换HTML标记:
</?[a-zA-Z][a-zA-Z0-9 \"=_-]*?>
说明:
<
maybe / (as it may be closing tag)
match a-z or A-Z as the first letter
MAYBE match any of a-z, or A-Z, 0-9, "=_- indefinitely
>
最终守则:
using System;
using System.Text.RegularExpressions;
namespace Regular
{
class Program
{
static void Main(string[] args)
{
string yourText = "this text has <b>weird < things</b> going on >";
string newText = Regex.Replace(yourText, "</?[a-zA-Z][a-zA-Z0-9 \"=_-]*>", "");
Console.WriteLine(newText);
}
}
}
输出:
这段文字很奇怪&lt;事情正在进行&gt;
@corey-ogburn的评论不正确,因为&lt; [space] abc&gt;将被取代。
由于您只想将它们从字符串中删除,因此我没有看到您要检查是否有标记开始/结束的原因,但您可以轻松地使用正则表达式进行制作。< / p>
使用RegEx解析HTML并不总是一个不错的选择,但我认为如果你想解析简单的文本就没问题。
答案 1 :(得分:0)
你需要:
using System;
using System.Collections.Generic;
代码:
/// <summary>
/// Instances of this class strip HTML/XML tags from a string
/// </summary>
public class HTMLStripper
{
public HTMLStripper() { }
public HTMLStripper(string source)
{
m_source = source;
stripTags();
}
private const char m_beginToken = '<';
private const char m_endToken = '>';
private const char m_whiteSpace = ' ';
private enum tokenType
{
nonToken = 0,
beginToken = 1,
endToken = 2,
escapeToken = 3,
whiteSpace = 4
}
private string m_source = string.Empty;
private string m_stripped = string.Empty;
private string m_tagName = string.Empty;
private string m_tag = string.Empty;
private Int32 m_startpos = -1;
private Int32 m_endpos = -1;
private Int32 m_currentpos = -1;
private IList<string> m_skipTags = new List<string>();
private bool m_tagFound = false;
private bool m_tagsStripped = false;
/// <summary>
/// Gets or sets the source string.
/// </summary>
/// <value>
/// The source string.
/// </value>
public string source { get { return m_source; } set { clear(); m_source = value; stripTags(); } }
/// <summary>
/// Gets the string stripped of HTML tags.
/// </summary>
/// <value>
/// The string.
/// </value>
public string stripped { get { return m_stripped; } set { } }
/// <summary>
/// Gets or sets a value indicating whether [HTML tags were stripped].
/// </summary>
/// <value>
/// <c>true</c> if [HTML tags were stripped]; otherwise, <c>false</c>.
/// </value>
public bool tagsStripped { get { return m_tagsStripped; } set { } }
/// <summary>
/// Adds the name of an HTML tag to skip stripping (leave in the text).
/// </summary>
/// <param name="value">The value.</param>
public void addSkipTag(string value)
{
if (value.Length > 0)
{
// Trim start and end tokens from skipTags if present and add to list
CharEnumerator tmpScanner = value.GetEnumerator();
string tmpString = string.Empty;
while (tmpScanner.MoveNext())
{
if (tmpScanner.Current != m_beginToken && tmpScanner.Current != m_endToken) { tmpString += tmpScanner.Current; }
}
if (tmpString.Length > 0) { m_skipTags.Add(tmpString); }
}
}
/// <summary>
/// Clears this instance.
/// </summary>
public void clear()
{
m_source = string.Empty;
m_tag = string.Empty;
m_startpos = -1;
m_endpos = -1;
m_currentpos = -1;
m_tagsStripped = false;
}
/// <summary>
/// Clears all.
/// </summary>
public void clearAll()
{
this.clear();
m_skipTags.Clear();
}
/// <summary>
/// Strips the HTML tags.
/// </summary>
private void stripTags()
{
// Preserve source and make a copy for stripping
m_stripped = m_source;
// Find first tag
getNext();
// If there are any tags (if next tag is string.Empty we are at EOS)...
if (m_tagName != string.Empty)
{
do
{
// If the tag we found is not to be skipped...
if (!m_skipTags.Contains(m_tagName))
{
// Remove tag from string
m_stripped = m_stripped.Remove(m_startpos, m_endpos - m_startpos + 1);
m_tagsStripped = true;
}
// Get next tag, rinse and repeat (if next tag is string.Empty we are at EOS)
getNext();
} while (m_tagName != string.Empty);
}
}
/// <summary>
/// Steps the pointer to the next HTML tag.
/// </summary>
private void getNext()
{
m_tagFound = false;
m_tag = string.Empty;
m_tagName = string.Empty;
bool beginTokenFound = false;
CharEnumerator scanner = m_stripped.GetEnumerator();
// If we're not at the beginning of the string, move the enumerator to the appropriate location in the string
if (m_currentpos != -1)
{
Int32 index = 0;
do
{
scanner.MoveNext();
index += 1;
} while (index < m_currentpos + 1);
}
while (!m_tagFound && m_currentpos + 1 < m_stripped.Length)
{
// Find next begin token
while (scanner.MoveNext())
{
m_currentpos += 1;
if (evaluateChar(scanner.Current) == tokenType.beginToken)
{
m_startpos = m_currentpos;
beginTokenFound = true;
break;
}
}
// If a begin token is found, find next end token
if (beginTokenFound)
{
while (scanner.MoveNext())
{
m_currentpos += 1;
// If we find another begin token before finding an end token we are not in a tag
if (evaluateChar(scanner.Current) == tokenType.beginToken)
{
m_tagFound = false;
beginTokenFound = true;
break;
}
// If the char immediately following a begin token is a white space we are not in a tag
if (m_currentpos - m_startpos == 1 && evaluateChar(scanner.Current) == tokenType.whiteSpace)
{
m_tagFound = false;
beginTokenFound = true;
break;
}
// End token found
if (evaluateChar(scanner.Current) == tokenType.endToken)
{
m_endpos = m_currentpos;
m_tagFound = true;
break;
}
}
}
if (m_tagFound)
{
// Found a tag, get the info for this tag
m_tag = m_stripped.Substring(m_startpos, (m_endpos + 1) - m_startpos);
m_tagName = m_stripped.Substring(m_startpos + 1, m_endpos - m_startpos - 1);
// If this tag is to be skipped, we do not want to reset the position within the string
// Also, if we are at the end of the string (EOS) we do not want to reset the position
if (!m_skipTags.Contains(m_tagName) && m_currentpos != stripped.Length)
{
m_currentpos = -1;
}
}
}
}
/// <summary>
/// Evaluates the next character.
/// </summary>
/// <param name="value">The value.</param>
/// <returns>tokenType</returns>
private tokenType evaluateChar(char value)
{
tokenType returnValue = new tokenType();
switch (value)
{
case m_beginToken:
returnValue = tokenType.beginToken;
break;
case m_endToken:
returnValue = tokenType.endToken;
break;
case m_whiteSpace:
returnValue = tokenType.whiteSpace;
break;
default:
returnValue = tokenType.nonToken;
break;
}
return returnValue;
}
}