Question

我想从字符串

中删除php标签

content = re.sub('<\?php(.*)\?>', '', content)

似乎在单行php标签上运行正常但是当一个php标签关闭后的某些行时，它无法捕获它。有人可以帮忙吗？

Answer 1

使用正则表达式无法解决此问题。从字符串中解析PHP需要一个能够理解至少一点PHP的真正解析器。

但是，如果你有PHP可用，你可以很容易地解决这个问题。最后的PHP解决方案。

以下演示了正则表达式方法出错的方法：

import re

testcases = {
    'easy':("""show this<?php echo 'NOT THIS'?>""",'show this'),
    'multiple tags':("""<?php echo 'NOT THIS';?>show this, even though it's conditional<?php echo 'NOT THIS'?>""","show this, even though it's conditional"),
    'omitted ?>':("""show this <?php echo 'NOT THIS';""", 'show this '),
    'nested string':("""show this <?php echo '<?php echo "NOT THIS" ?>'?> show this""",'show this  show this'),
    'shorttags':("""show this <? echo 'NOT THIS SHORTTAG!'?> show this""",'show this  show this'),
    'echotags':("""<?php $TEST = "NOT THIS"?>show this <?=$TEST?> show this""",'show this  show this'),
}

testfailstr = """
FAILED: %s
IN:     %s
EXPECT: %s
GOT:    %s
"""

removephp = re.compile(r'(?s)<\?php.*\?>')

for testname, (in_, expect) in testcases.items():
    got = removephp.sub('',in_)
    if expect!=got:
        print testfailstr % tuple(map(repr, (testname, in_, expect, got)))

请注意，使用正则表达式传递所有测试用例是非常困难的，如果不是不可能的话。

如果您有PHP可用，您可以使用PHP的标记生成器去除PHP。以下代码应该从字符串中删除所有 PHP代码，并且应该涵盖所有奇怪的角落案例。

// one-character token, always code
define('T_ONECHAR_TOKEN', 'T_ONECHAR_TOKEN');

function strip_php($input) {
    $tokens = token_get_all($input);

    $output = '';
    $inphp = False;
    foreach ($tokens as $token) {
        if (is_string($token)) {
            $token = array(T_ONECHAR_TOKEN, $token);
        }
        list($id, $str) = $token;
        if (!$inphp) {
            if ($id===T_OPEN_TAG or $id==T_OPEN_TAG_WITH_ECHO) {
                $inphp = True;
            } else {
                $output .= $str;
            }
        } else {
            if ($id===T_CLOSE_TAG) {
                $inphp = False;
            }
        }
    }

    return $output;
}

$test = 'a <?php //NOT THIS?>show this<?php //NOT THIS';


echo strip_php($test);

Answer 2

如果您只想处理简单的情况，一个简单的正则表达式将正常工作。 Python正则表达式中的*?运算符给出了最小匹配。

import re

_PHP_TAG = re.compile(r'<\?php.*?\?>', re.DOTALL)
def strip_php(content):
    return _PHP_TAG.sub('', content)

INPUT = """
Simple: <?php echo $a ?>.
Two on one line: <?php echo $a ?>, <?php echo $b ?>.
Multiline: <?php 
    if ($a) {
        echo $b;
    }
?>.
"""

print strip_php(INPUT)

输出：

Simple: .
Two on one line:  (keep this) .
Multiline: .

我希望你不是用它来消毒输入，，因为这不足以达到这个目的。（这是黑名单，不是白名单，黑名单永远不够。）

如果您想处理复杂的案例，例如：

<?php echo '?>' ?>

您仍然可以使用正则表达式执行此操作，但您可能希望重新考虑使用哪些工具，因为正则表达式可能会变得太复杂而无法阅读。以下正则表达式将处理Francis Avila的所有测试用例：

dstr = r'"(?:[^"\\]|\\.)*"'
sstr = r"'(?:[^'\\]|\\.)*'"
_PHP_TAG = re.compile(
    r'''<\?[^"']*?(?:(?:%s|%s)[^"']*?)*(?:\?>|$)''' % (dstr, sstr)
)
def strip_php(content):
    return _PHP_TAG.sub('', content)

正则表达式几乎足以解决此问题。我们之所以知道这一点，是因为PHP使用正则表达式来标记PHP源代码。您可以阅读PHP在Zend/zend_language_scanner.l中使用的正则表达式。它是为Lex编写的，这是一种通用正则表达式创建标记器的常用工具。

我说“差不多”的原因是因为我们实际上正在使用扩展的正则表达式。

Answer 3

你可以通过这个来做到这一点：

content = re.sub('\n','', content)
content = re.sub('<\?php(.*)\?>', '', content)

OP的评论后

更新了答案：

content = re.sub('\n',' {NEWLINE} ', content)
content = re.sub('<\?php(.*)\?>', '', content)
content = re.sub(' {NEWLINE} ','\n', content)

ipython中的示例：

In [81]: content
Out[81]: ' 11111  <?php 222\n\n?> \n22222\nasd  <?php asd\nasdasd\n?>\n3333\n'

In [82]: content = re.sub('\n',' {NEWLINE} ', content)
In [83]: content
Out[83]: ' 11111  <?php 222 {NEWLINE}  {NEWLINE} ?>  {NEWLINE} 22222 {NEWLINE} asd  <?php asd {NEWLINE} asdasd {NEWLINE} ?> {NEWLINE} 3333 {NEWLINE} '

In [84]: content = re.sub('<\?php(.*)\?>', '', content)
In [85]: content
Out[85]: ' 11111   {NEWLINE} 3333 {NEWLINE} '

In [88]: content = re.sub(' {NEWLINE} ','\n', content)
In [89]: content
Out[89]: ' 11111  \n3333\n'

使用python从字符串中删除php标签

3 个答案: