Question

获取包含HTML标记的文本的子字符串

假设您需要以下前10个字符：

“＆lt; p＆gt;这是第1段＆lt; / p＆gt;

这是第2段＆lt; / p＆gt;”

输出结果为：

“＆lt; p＆gt;这是”

返回的文本包含未关闭的P标记。如果将其呈现给页面，则后续内容将受到打开的P标记的影响。理想情况下，首选输出将关闭任何未关闭的HTML标记，与其打开时相反：

“＆lt; p＆gt;这是＆lt; / p＆gt;” 我想要一个返回HTML子字符串的函数，确保没有标签未被关闭

Answer 1

您需要教您的代码如何理解您的字符串实际上是HTML或XML。只是将它视为一个字符串就不允许你按照你想要的方式使用它。这意味着首先将其转换为正确的格式，然后使用该格式。

使用XSL样式表

如果您的HTML格式正确，请将其加载到XMLDocument并通过XSL样式表运行，该样式表执行以下操作：

<xsl:template match="p">
  <xsl:value-of select="substring(text(), 0, 10)" />
</xsl:template>

使用HTML解析器

如果它不是格式良好的XML（如您的示例中，中间突然），则需要使用a HTML parser of some kind，例如HTML Agility Pack （见question about C# HTML parsers）。

不要使用正则表达式，因为HTML is too complex to parse using regex。

Answer 2

您可以使用下一个静态功能。有关工作示例，请检查：http://www.koodr.com/item/438c2e9c-62a8-45fc-9ca2-db1479f412e1。您也可以将其转换为扩展方法。

public static string HtmlSubstring (string html, int maxlength) {
//initialize regular expressions
string htmltag = "</?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>";
string emptytags = "<(\\w+)((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?></\\1>";

//match all html start and end tags, otherwise get each character one by one..
var expression = new Regex(string.Format("({0})|(.?)", htmltag)); 
MatchCollection matches = expression.Matches(html);

int i = 0;
StringBuilder content = new StringBuilder();
foreach (Match match in matches)
{
    if (match.Value.Length == 1
        && i < maxlength) 
    {                    
        content.Append(match.Value);
        i++; 
    }
    //the match contains a tag
    else if (match.Value.Length > 1) 
        content.Append(match.Value);
}

return Regex.Replace(content.ToString(), emptytags, string.Empty); }

Answer 3

您的要求非常不明确，因此大部分都是猜测。此外，您没有提供任何有助于澄清您想要做什么的代码。

一种解决方案可能是：

一个。查找和标记之间的文字。您可以使用以下Regex或使用简单的字符串搜索：

\<p\>(.*?)\</p\>

湾在找到的文本中，应用Substring()来提取所需的文本。

℃。将提取的文本放回和标记之间。

Answer 4

您可以遍历html字符串以检测尖括号并构建一个标记数组以及每个标记是否有匹配的结束标记。问题是，HTML允许非关闭标签，例如img，br，meta - 所以你需要知道这些。你还需要有规则来检查关闭的顺序，因为只是将open与close匹配不能生成有效的HTML - 如果你打开一个div，然后ap然后关闭div然后关闭p，那就是没有效。

Answer 5

试试这段代码（python 3.x）：

notags=('img','br','hr')
def substring2(html,size):
    if len(html) <= size:
        return html
    result,tag,count='','',0
    tags=[]
    for c in html:
        result += c
        if c == '<':
            intag=True
        elif c=='>':
            intag=False
            tag=tag.split()[0]
            if tag[0] == '/':
                tag = tag.replace('/','')
                if tag not in notags:
                    tags.pop()
            else:
                if tag[-1] != '/' and tag not in notags:
                    tags.append(tag)
            tag=''
        else:
            if intag: 
                tag += c
            else:
                count+=1
                if count>=size: break
    while len(tags)>0:
        result += '</{0}>'.format(tags.pop())
    return result

s='<div class="main">html <code>substring</code> function written by <span>imxylz</span>, using <a href="http://www.python.org">python</a> language</div>'
print(s)
for size in (30,40,55):
    print(substring2(s,size))

<强>输出

<div class="main">html <code>substring</code> function written by <span>imxylz</span>, using <a href="http://www.python.org">python</a> language</div>
<div class="main">html <code>substring</code> function writte</div>
<div class="main">html <code>substring</code> function written by <span>imxyl</span></div>
<div class="main">html <code>substring</code> function written by <span>imxylz</span>, using <a href="http://www.python.org">python</a></div>

更多

请参阅github处的代码。

另一个question。

获取包含HTML标记的文本的子字符串

5 个答案:

使用XSL样式表

使用HTML解析器