Question

考虑以下字符串

你好，我的名字是冰岛，很高兴见到你

我需要扫描此字符串并将每个字符分类为以下类型之一：

1）西方文本（仅限字母和数字）
2）中文文本（仅限表意文字，无标点符号）
3）其他任何事物（其他任何东西，无论是西方还是中国或其他）

任何人都能指出我正确的方向吗？感谢

编辑：因为我认为由于过于通用而被推翻了。

for($i=0, $l = mb_strlen($string) - 1; $i<$l; $i++)
  {
   $char = mb_substr($string, $i, 1);

   if(preg_match("/^[a-zA-Z]$/", $char)) $type = "alpha";
   else
   ...
   ;
  }

除了检测字母字符之外的正则表达式无视我的知识，特别是只需要包含汉字表意文字，并留下所有汉字标点和特殊符号。

Answer 1

我可以建议您使用preg_replace_callback来获取所需的文本块，使用正则表达式将不同类别的文本捕获到不同的组中，并根据这些捕获构建生成的数组：

$s = "hello, my name is 冰岛, nice to meet you";
$res = array();
preg_replace_callback('~\b(?<Chinese>\p{Han}+)\b|\b(?<Western>[a-zA-Z0-9]+)\b|(?<Other>[^\p{Han}A-Za-z0-9\s]+)~su',
 function($m) use (&$res) {
    if (!empty($m["Chinese"])) {
        $t = array("type" => "Han", "value" => $m["Chinese"]);
        array_push($res,$t);
    }
    else if (!empty($m["Western"])) {
        $t = array("type" => "Western", "value" => $m["Western"]);
        array_push($res, $t);
    }
    else  if (!empty($m["Other"])) {
        $t=array("type" => "Other", "value" => $m["Other"]);
        array_push($res, $t);
    }
 },
$s);
print_r($res);

请参阅online PHP demo

<强>模式：

\b(?<Chinese>\p{Han}+)\b - 完整的中文字
| - 或
\b(?<Western>[a-zA-Z0-9]+)\b - 仅由ASCII字母和数字组成的整个单词
| - 或
(?<Other>[^\p{Han}A-Za-z0-9\s]+) - 除中文字符，ASCII字母，ASCII数字和空格（\s）以外的任何1+个符号。

~s修饰符在这里是多余的，但是如果你想匹配换行符，它会使.与这些字符匹配。

此处需要~u，因为您处理的是Unicode字符串。

另外，请参阅Unicode Properties section at the regular-expressions.info中有关Unicode属性的更多信息（例如，您可能对\p{P}和\p{S}属性感兴趣）。

PHP：对字符串中的字符进行分类

1 个答案: