Question

我需要匹配特定html标记或伪标记之外的所有新行字符。

这是一个例子。我希望在此文本片段中匹配"\n"个所有[code] [/code]个标签（以便用<br>标签替换它们）：

These concepts are represented by simple Python classes.  
Edit the polls/models.py file so it looks like this: 

[code]  
from django.db import models

class Question(models.Model):
    question_text = models.CharField(max_length=200)
    pub_date = models.DateTimeField('date published') 
[/code]

我知道我应该使用负面的前瞻，但我正在努力解决整个问题。

具体来说，我需要一个PCRE表达式，我将使用PHP和Python。

Answer 1

对我而言，这种情况似乎直接来自Match (or replace) a pattern except in situations s1, s2, s3 etc。请访问该链接以获得有关解决方案的完整讨论。

我会给你PHP和Python的答案（因为这个例子提到了django）。

<强> PHP

(?s)\[code\].*?\[/code\](*SKIP)(*F)|\n

交替的左侧匹配完整的[code] ... [/ code]标签，然后故意失败，并跳过刚刚匹配的字符串部分。右侧匹配换行符，我们知道它们是正确的换行符，因为它们与左侧的表达式不匹配。

这个PHP程序展示了如何使用正则表达式（参见online demo底部的结果）：

<?php
$regex = '~(?s)\[code\].*?\[/code\](*SKIP)(*F)|\n~';

$subject = "These concepts are represented by simple Python classes.
Edit the polls/models.py file so it looks like this:

[code]
from django.db import models

class Question(models.Model):
question_text = models.CharField(max_length=200)
pub_date = models.DateTimeField('date published')
[/code]";

$replaced = preg_replace($regex,"<br />",$subject);
echo $replaced."<br />\n";
?>

<强>的Python

对于Python，这是我们简单的正则表达式：

(?s)\[code\].*?\[/code\]|(\n)

交替的左侧匹配完整的[code]...[/code]标签。我们将忽略这些匹配。右侧匹配并捕获第1组的换行符，我们知道它们是正确的换行符，因为它们与左侧的表达式不匹配。

这个Python程序展示了如何使用正则表达式（参见online demo底部的结果）：

import re
subject = """These concepts are represented by simple Python classes.  
Edit the polls/models.py file so it looks like this: 

[code]  
from django.db import models

class Question(models.Model):
    question_text = models.CharField(max_length=200)
    pub_date = models.DateTimeField('date published') 
[/code]"""

regex = re.compile(r'(?s)\[code\].*?\[/code\]|(\n)')
def myreplacement(m):
    if m.group(1):
        return "<br />"
    else:
        return m.group(0)
replaced = regex.sub(myreplacement, subject)
print(replaced)

Answer 2

这是在python中实现的另一个解决方案，但没有使用正则表达式。它还处理嵌套的代码块（如果需要）并逐行读取文件，这对于处理非常大的文件非常有用。

input_file = open('file.txt', 'r')
output_file = open('output.txt', 'w')

    in_code = 0
    for line in input_file:
        if line.startswith('[code]'):
            if in_code == 0:
                line = '\n' + line
            in_code += 1
            output_file.write(line)
        elif line.startswith('[/code]'):
            in_code -= 1
            output_file.write(line)
        else:
            if in_code == 0:
                output_file.write(line.rstrip('\n') + '<br />')
            else:
                output_file.write(line)

正则表达式匹配某些标记之外的所有新行字符

2 个答案: