为意大利的电子发票构建XML时,我需要过滤字符串。
仅接受来自特定对象的
:String1000LatinType
"[\p{IsBasicLatin}\p{IsLatin-1Supplement}]{1,1000}"
我不了解范围,但我认为:
a-z
,A-Z
,0-9
,重音符号为:à ò ù è é ì
,ç
,符号为:, . _ - : ; '
和空格
我想直接从键盘中排除所有其他符号,例如:"£$%&/()=?^°§*+\|/<>
和tab
我尝试使用此函数进行转换,但我不是使用regexp的专家:
function sanitize($tag) {
$newtag = preg_replace ("/[\p{Latin}A-Z0-9a-z\-\_\.\,\:\;' ]/", "", $tag);
return $newtag;
}
$tag = "Qwerty 12345 £$%&/()=?^ èéòàùì +*°ç.,-_<>\/l'èok .,;:";
var_dump(sanitize($tag));
有人可以帮我吗?
我想找回
Qwerty 12345 èéòàùì ç.,-_l'èok .,;:
答案 0 :(得分:0)
似乎PHP不支持\p{IsLatin-1Supplement}
。但是,您可以在正则表达式中使用Unicode代码点范围。为Wikipedia says:
此块的范围从
U+0080
到U+00FF
\p{IsBasicLatin}
将字符从U+0000
匹配到U+007F
。因此,您需要匹配除代码点从\x00
到\xFF
的char以外的任何char +所有标点和符号,特殊字符除外:
preg_replace('~(?:[^\x00-\xFF]|(?![.,_\'-])[\p{P}\p{S}])~u', '', $tag)
请参见regex demo。
详细信息
(?:
-一个非捕获组的开始
[^\x00-\xFF]
-除\x00
至\xFF
的Unicode代码点范围内的字符以外的任何字符|
-或(?![.,_\'-])[\p{P}\p{S}])
-不等于\p{P}
列表中字符的任何标点符号(\p{S}
)或符号(.,_'-
)。)+
-组结束,重复1次或更多次。请参见PHP demo:
function sanitize($tag) {
$newtag = preg_replace('~(?:[^\x00-\xFF]|(?![.,_\'-])[\p{P}\p{S}])+~u', '', $tag);
return $newtag;
}
$tag = "Qwerty 12345 £$%&/()=?^ èéòàùì +*°ç.,-_<>\/l'èok .,;:";
var_dump(sanitize($tag));
// => Qwerty 12345 èéòàùì ç.,-_l'èok .,;:
答案 1 :(得分:0)
经过一些测试,我创建了此功能以适合我的目的:
function sanitize_string_xml($string, $opzioni = array()) {
$chr_map = array(
// Windows codepage 1252
"\xC2\x82" => "'", // U+0082⇒U+201A single low-9 quotation mark
"\xC2\x84" => '"', // U+0084⇒U+201E double low-9 quotation mark
"\xC2\x8B" => "'", // U+008B⇒U+2039 single left-pointing angle quotation mark
"\xC2\x91" => "'", // U+0091⇒U+2018 left single quotation mark
"\xC2\x92" => "'", // U+0092⇒U+2019 right single quotation mark
"\xC2\x93" => '"', // U+0093⇒U+201C left double quotation mark
"\xC2\x94" => '"', // U+0094⇒U+201D right double quotation mark
"\xC2\x9B" => "'", // U+009B⇒U+203A single right-pointing angle quotation mark
// Regular Unicode // U+0022 quotation mark (")
// U+0027 apostrophe (')
"\xC2\xAB" => '"', // U+00AB left-pointing double angle quotation mark
"\xC2\xBB" => '"', // U+00BB right-pointing double angle quotation mark
"\xE2\x80\x98" => "'", // U+2018 left single quotation mark
"\xE2\x80\x99" => "'", // U+2019 right single quotation mark
"\xE2\x80\x9A" => "'", // U+201A single low-9 quotation mark
"\xE2\x80\x9B" => "'", // U+201B single high-reversed-9 quotation mark
"\xE2\x80\x9C" => '"', // U+201C left double quotation mark
"\xE2\x80\x9D" => '"', // U+201D right double quotation mark
"\xE2\x80\x9E" => '"', // U+201E double low-9 quotation mark
"\xE2\x80\x9F" => '"', // U+201F double high-reversed-9 quotation mark
"\xE2\x80\xB9" => "'", // U+2039 single left-pointing angle quotation mark
"\xE2\x80\xBA" => "'", // U+203A single right-pointing angle quotation mark
);
$type = isset($opzioni['Type']) ? $opzioni['Type'] : ""; // IsBasicLatin /IsLatin
$lunghezzaMax = isset($opzioni['LunghezzaMax']) ? $opzioni['LunghezzaMax'] : "";
if ( $type == "IsBasicLatin" ) {
$unwanted_array = array( 'Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U',
'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c',
'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o',
'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', "ü" => "u", 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y' );
$string = strtr( $string, $unwanted_array );
$string = preg_replace('/[^\x{0020}-\x{007E}]+/u', '', $string);
}
if ( $type == "IsLatin" ) {
$unwanted_array = array( 'Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z' );
$string = strtr( $string, $unwanted_array );
$string = preg_replace('/[^\x{0020}-\x{007E}\x{00A0}-\x{00FF}]+/u', '', $string);
}
// CONVERTI GLI ACCENTI FUORI DAL RANGE IN APICI AMMESSI:
$chr = array_keys ($chr_map); // but: for efficiency you should
$rpl = array_values($chr_map); // pre-calculate these two arrays
$string = str_replace($chr, $rpl, html_entity_decode($string, ENT_QUOTES, "UTF-8"));
$string = htmlspecialchars(str_replace(PHP_EOL, " ", $string));
if ( $lunghezzaMax != "" ) {
$string = substr($string, 0, $lunghezzaMax);
}
return $string;
}
用法示例:
$clear_string = sanitize_string_xml($dirty_string, array("Type" => "IsLatin", "LunghezzaMax" => 60));