Question

我正在尝试使用Notepad ++中的REGEX批处理（搜索和替换）几十万个html页面。所有的html页面都有完全相同的布局，我基本上试图将一个元素（一个标题）复制到当前不为空的页面标签

<html>
<head>
<title>some title</title>
<lots of junk and newlines>
</head>
<body>
<lots of stuff, tags, content><span>stuff</span><div>more stuff</div>
<div id="uniqueID">
<span>The Title that should be copied into head's title tag</span>
</div>
...other stuff...</body>

我可以找到：

The title tag: <title>(.*?)</title>
And the span containing the REAL title: 
(\s*<div id="uniqueID">\s*)<span>(.*)</span>(\s*</div>)

但我似乎无法将它们放入一个表达式（忽略其中的垃圾），以便能够在Notepad ++中搜索并替换它。

uniqueID div在每个页面（空格，换行符）中都是相同的，其中没有任何内容与其内容的跨度。标题标签显然只在每个页面中出现一次。我刚开始使用正则表达式，可能性无穷无尽。我知道它不是完美的解析HTML，但对于这种情况，它应该。任何人都知道如何将这两个表达式一起修补以忽略中间内容？

非常感谢你！

Answer 1

您可以在Notepad ++的“替换”对话框中使用以下内容将span中的标题复制到title标记...

查找内容： <title>(.*)</title>(.*<div id="uniqueID">\s*<span>([A-Za-z ']*)</span>\s*</div>)
* 替换为：* <title>$3</title>$2

...如果您选择正则表达式并选中。在对话框中匹配newlin （是的，＆＃34; newlin＆＃34;而不是＆＃34;换行符＆＃34; - 至少在我正在使用的机器上的Notepad ++版本中）。通过使用$2和$3，您正在利用对群组的反向引用＆＃39;捕获的值。

将span与标题匹配的约束模式较少，可能会在文件中稍后抓取span - 例如：

<html>
<head>
<title>some title</title>
<lots of junk and newlines>
</head>
<body>
<lots of stuff, tags, content><span>stuff</span><div>more stuff</div>
<div id="uniqueID">
<span>The Title that should be copied into head's title tag</span>
</div>
<div>
<span>The text that should not be copied into the head's title tag but will be</span>
</div>
...other stuff...</body>

如果要从span复制的标题除了大写和小写字母，数字，空格和撇号之外还有其他字符，那么您可以根据需要添加到字符组[A-Za-z ']（例如[A-Za-z '_]包括下划线）。只需注意HTML标记字符本身 - 例如<和>。

正则表达式Notepad ++ html搜索替换

1 个答案: