Question

为意大利的电子发票构建XML时，我需要过滤字符串。

仅接受来自特定对象的

：

String1000LatinType
"[\p{IsBasicLatin}\p{IsLatin-1Supplement}]{1,1000}"

我不了解范围，但我认为：

a-z，A-Z，0-9，重音符号为：à ò ù è é ì，ç，符号为：, . _ - : ; '和空格

我想直接从键盘中排除所有其他符号，例如："£$%&/()=?^°§*+\|/<>和tab

我尝试使用此函数进行转换，但我不是使用regexp的专家：

function sanitize($tag) {

$newtag = preg_replace ("/[\p{Latin}A-Z0-9a-z\-\_\.\,\:\;' ]/", "", $tag);

return $newtag;

}

$tag = "Qwerty 12345 £$%&/()=?^ èéòàùì +*°ç.,-_<>\/l'èok .,;:";

var_dump(sanitize($tag));

有人可以帮我吗？

我想找回

Qwerty 12345  èéòàùì ç.,-_l'èok .,;:

Answer 1

似乎PHP不支持\p{IsLatin-1Supplement}。但是，您可以在正则表达式中使用Unicode代码点范围。为Wikipedia says：

此块的范围从U+0080到U+00FF

\p{IsBasicLatin}将字符从U+0000匹配到U+007F。因此，您需要匹配除代码点从\x00到\xFF的char以外的任何char +所有标点和符号，特殊字符除外：

preg_replace('~(?:[^\x00-\xFF]|(?![.,_\'-])[\p{P}\p{S}])~u', '', $tag)

请参见regex demo。

详细信息

(?:-一个非捕获组的开始
- [^\x00-\xFF]-除\x00至\xFF的Unicode代码点范围内的字符以外的任何字符
- |-或
- (?![.,_\'-])[\p{P}\p{S}])-不等于\p{P}列表中字符的任何标点符号（\p{S}）或符号（.,_'-）。
)+-组结束，重复1次或更多次。

请参见PHP demo：

function sanitize($tag) {
 $newtag = preg_replace('~(?:[^\x00-\xFF]|(?![.,_\'-])[\p{P}\p{S}])+~u', '', $tag);
 return $newtag;
} 
$tag = "Qwerty 12345 £$%&/()=?^ èéòàùì +*°ç.,-_<>\/l'èok .,;:";
var_dump(sanitize($tag));
// => Qwerty 12345  èéòàùì ç.,-_l'èok .,;:

Answer 2

经过一些测试，我创建了此功能以适合我的目的：

function sanitize_string_xml($string, $opzioni = array()) {

    $chr_map = array(
       // Windows codepage 1252
       "\xC2\x82" => "'", // U+0082⇒U+201A single low-9 quotation mark
       "\xC2\x84" => '"', // U+0084⇒U+201E double low-9 quotation mark
       "\xC2\x8B" => "'", // U+008B⇒U+2039 single left-pointing angle quotation mark
       "\xC2\x91" => "'", // U+0091⇒U+2018 left single quotation mark
       "\xC2\x92" => "'", // U+0092⇒U+2019 right single quotation mark
       "\xC2\x93" => '"', // U+0093⇒U+201C left double quotation mark
       "\xC2\x94" => '"', // U+0094⇒U+201D right double quotation mark
       "\xC2\x9B" => "'", // U+009B⇒U+203A single right-pointing angle quotation mark

       // Regular Unicode     // U+0022 quotation mark (")
                              // U+0027 apostrophe     (')
       "\xC2\xAB"     => '"', // U+00AB left-pointing double angle quotation mark
       "\xC2\xBB"     => '"', // U+00BB right-pointing double angle quotation mark
       "\xE2\x80\x98" => "'", // U+2018 left single quotation mark
       "\xE2\x80\x99" => "'", // U+2019 right single quotation mark
       "\xE2\x80\x9A" => "'", // U+201A single low-9 quotation mark
       "\xE2\x80\x9B" => "'", // U+201B single high-reversed-9 quotation mark
       "\xE2\x80\x9C" => '"', // U+201C left double quotation mark
       "\xE2\x80\x9D" => '"', // U+201D right double quotation mark
       "\xE2\x80\x9E" => '"', // U+201E double low-9 quotation mark
       "\xE2\x80\x9F" => '"', // U+201F double high-reversed-9 quotation mark
       "\xE2\x80\xB9" => "'", // U+2039 single left-pointing angle quotation mark
       "\xE2\x80\xBA" => "'", // U+203A single right-pointing angle quotation mark
    );

    $type = isset($opzioni['Type']) ? $opzioni['Type'] : "";    // IsBasicLatin /IsLatin

    $lunghezzaMax = isset($opzioni['LunghezzaMax']) ? $opzioni['LunghezzaMax'] : "";

    if ( $type == "IsBasicLatin" ) {

        $unwanted_array = array(    'Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
                            'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U',
                            'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c',
                            'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o',
                            'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', "ü" => "u", 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y' );

        $string = strtr( $string, $unwanted_array );

        $string = preg_replace('/[^\x{0020}-\x{007E}]+/u', '', $string);

    }

    if ( $type == "IsLatin" ) {

        $unwanted_array = array(  'Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z' );

        $string = strtr( $string, $unwanted_array );

        $string = preg_replace('/[^\x{0020}-\x{007E}\x{00A0}-\x{00FF}]+/u', '', $string);

    }

    //  CONVERTI GLI ACCENTI FUORI DAL RANGE IN APICI AMMESSI:

    $chr = array_keys  ($chr_map); // but: for efficiency you should

    $rpl = array_values($chr_map); // pre-calculate these two arrays

    $string = str_replace($chr, $rpl, html_entity_decode($string, ENT_QUOTES, "UTF-8"));




    $string = htmlspecialchars(str_replace(PHP_EOL, " ", $string));

    if ( $lunghezzaMax != "" ) {
        $string = substr($string, 0, $lunghezzaMax);
    }

    return $string;

}

用法示例：

$clear_string = sanitize_string_xml($dirty_string, array("Type" => "IsLatin", "LunghezzaMax" => 60));

PHP preg_replace仅拉丁字符

2 个答案: