Question

我有一个简单的要求在html中提取文本。假设html是

<h1>hello</h1> ... <img moduleType="calendar" /> ...<h2>bye</h2>

我想将其转换为三个部分

<h1>hello</h1>

<img moduleType="calendar" />

<h2>bye</h2>

目的是提取两个类别的文本，简单的html和带有＆lt; img moduleType =“Calendar”的特殊标签。

Answer 1

不要那样做; HTML可以通过许多美妙的方式打破。请改用beautiful soup。

Answer 2

这取决于您使用的语言和背景。我在CMS上做了类似的事情，我的方法是首先找到标签，然后是属性。

获取标签

"<img (.*?)/>"

然后我在结果中搜索特定属性

'title="(.*?)"'

如果要查找所有属性，可以轻松地将显式标题更改为正则表达式[a-z]或非空白字符，然后循环显示这些结果。

Answer 3

我实际上尝试做类似的事情，因为asp.net编译器将标记编译到服务器控制树中，正则表达式被asp.net编译器大量使用。我有一个临时的解决方案，虽然不是很好，但似乎没问题。

//string source = "<h1>hello</h1>";
string source = "<h1>hello<img moduleType=\"calendar\" /></h1> <p> <img moduleType=\"calendar\" /> </p> <h2>bye</h2> <img moduleType=\"calendar\" /> <p>sss</p>";
Regex exImg = new Regex("(.+?)(<img.*?/>)");

var match = exImg.Match(source);
int lastEnd = 0;
while (match.Success)
{
    Console.WriteLine(match.Groups[1].Value);
    Console.WriteLine(match.Groups[2].Value);
    lastEnd = match.Index + match.Length;
    match = match.NextMatch();
}
Console.WriteLine(source.Substring(lastEnd, source.Length - lastEnd ));

使用正则表达式提取部分html

3 个答案: