Question

如果问题的标题不清楚。

我想解析一下：

<p><a rel="nofollow" data-xxx="797998" href="http://www.stackoverflow.com">StackOverflow</a> for the win</p>

对此：

http://www.stackoverflow.com StackOverflow for the win

我发现了许多有关如何使用HTML解析器甚至正则表达式删除HTML标记的有用问题，但没有提及保留HTML属性。

如何实现这一目标？

Answer 1

这应该可以通过正则表达式替换来实现。

string html = "<p><a rel=\"nofollow\" data-xxx=\"797998\" href=\"http://www.stackoverflow.com\">StackOverflow</a> for the win</p>";

string parsed = Regex.Replace(html, "<[^>]+href=\"([^\"]+)\"[^>]*>", "$1 ");
parsed = Regex.Replace(parsed, "<[^>]+>", "");

首先提取href属性并删除包含的标记。第二次运行将删除所有剩余的标签，包括结束标签等。

Answer 2

有些具体，但它会按照你的要求行事。

var str = '<p><a rel="nofollow" data-xxx="797998" href="http://www.stackoverflow.com">StackOverflow</a> for the win</p>';

var str = str.replace('</a>', '');
var str = str.replace('</p>', '');
var str = str.replace('">', '');

var p = str.indexOf('href="');

console.log(str.slice(p + 'href="'.length));

Answer 3

使用字符串的示例：

private string ParseAttribute(string input, string attributeName)
{
    int startIndex = input.IndexOf(attributeName + "=\"");

    if (startIndex >= 0)
    {
        startIndex += attributeName.Length + 2;
        int endIndex = input.IndexOf('"', startIndex);

        if (endIndex >= 0)
            return input.Substring(startIndex, endIndex - startIndex);
    }

    return string.Empty;
}

// usage
string html = "<p><a rel=\"nofollow\" data-xxx=\"797998\" href=\"http://www.stackoverflow.com\">StackOverflow</a> for the win</p>";

Console.WriteLine(ParseAttribute(html, "href"));

此代码可能有其缺点，但会按您的要求执行。

编辑：好的，我看到我错过了你也想要元素的内容。但是我还是留下了这段代码片段。也许它在某种程度上是有用的。

Answer 4

可以这么简单：

String yourinput = "...";
result = Regex.Replace(yourinput, "<.*?>", String.Empty);

从字符串中删除HTML标记，但保留href属性

4 个答案: