如何使用正则表达式删除文本中不正确的组合单词的空格?

时间:2017-10-24 01:04:40

标签: php regex preg-replace

如果文本中的单词以不正确的形式组合,我如何删除。例如,我有这样的文字:

HelloEveryOne, СаломБаХама, Ҳама дарПеши ҷаҳонЯк мебошадАммо.
HELLOeveryOneHelloFORyouYOU HELLO everyOneHello FORyouYOU
canBEcorrectedThisSTRINGinCorrectlyFORm
canBEcorrected ThisSTRINGin CorrectlyFORm 
Hello Every One, Салом Ба Хама, Ҳама дар Пеши ҷаҳон Як мебошад Аммо.
HELLO every One Hello FOR you YOU HELLO every One Hello FOR you YOU
can BE corrected This STRING in Correctly FOR m
can BE corrected This STRING in Correctly FOR m

谢谢你!

3 个答案:

答案 0 :(得分:2)

您可以使用unicode metacharacters查找大写和小写字母。类似的东西:

\B(\p{Lu}[\p{Ll}.,!]+)

并替换为

 \1

正则表达式演示:https://regex101.com/r/QskwDd/2/

在PHP中它可以用作:

$string = 'HelloEveryOne, СаломБаХама, Ҳама дарПеши ҷаҳонЯк мебошадАммо.';
echo preg_replace('/\B(\p{Lu}[\p{Ll}.,!]+)/u', ' \1', $string);

演示:https://3v4l.org/ZjHh4

更简单的方法可能只是寻找大写字母并添加空格。

\B\p{Lu}

替换为:

 \0

正则表达式演示:https://regex101.com/r/QskwDd/1/

答案 1 :(得分:1)

破解这是一个棘手的挑战! ......但是我知道了。使用否定的外观被证明无法消除不需要的子串。 (*SKIP)(*FAIL)技术完成了这项工作。

这背后的逻辑是无论间距如何都要针对三种类型的词。他们是:

  • 小写
  • 标题字符
  • 全部大写

请参阅php代码块中的内联注释,了解外行人的模式说明。

模式:Demo

/(?:\p{Ll}+|\p{Lu}\p{Ll}+|\p{Lu}{2,}+)[,.!?]?(?:\s|$)(*SKIP)(*FAIL)|(?:\p{Ll}+|\p{Lu}{2,}+|\p{Lu}\p{Ll}+)[,.!?]?/u

代码:(Demo

$input='HelloEveryOne, СаломБаХама, Ҳама дарПеши ҷаҳонЯк мебошадАммо.
HELLOeveryOneHelloFORyouYOU HELLO everyOneHello FORyouYOU
can,BEcorrectedThisSTRINGinCorrectlyFORm
canBEcorrected ThisSTRINGin CorrectlyFORm.';

//                                optional trailing punctuation-vvvv     vvvv- white space or end of input (that we don't want to replace)
var_export(preg_replace('/(?:\p{Ll}+|\p{Lu}\p{Ll}+|\p{Lu}{2,}+)[,.!?]?(?:\s|$)(*SKIP)(*FAIL)|(?:\p{Ll}+|\p{Lu}{2,}+|\p{Lu}\p{Ll}+)[,.!?]?/u','$0 ',$input));
//                 all lower-^^^^^^^               ^^^^^^^^^^^-all upper                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^-repeat first alternative without trailing white space or end of input
//          one upper then all lower-^^^^^^^^^^^^^                            ^^^^^^^^^^^^^^-discard these matches

输出:

'Hello Every One, Салом Ба Хама, Ҳама дар Пеши ҷаҳон Як мебошад Аммо.
HELLO every One Hello FOR you YOU HELLO every One Hello FOR you YOU
can, BE corrected This STRING in Correctly FOR m
can BE corrected This STRING in Correctly FOR m.'

答案 2 :(得分:-2)

我不认识这个语言环境,所以我无法测试这些奇怪的字符,但第一个字符串可以用这个来解决:

<?php

$str = 'HelloEveryOne';
$newStr = '';

for ($i = 0; $i < strlen($str); $i++ ) {
    $newStr .= ctype_upper($str[$i]) ? ' ' : '';
    $newStr .= $str[$i];
}

echo $newStr;

如果字符串包含大写的所有字符,则ctype_upper函数返回。我一次传递一个char,所以如果它是大写的,程序会在char之前添加一个空格。