我想使用PHP将文本拆分为单个单词。你知道如何实现这个目标吗?
我的方法:
function tokenizer($text) {
$text = trim(strtolower($text));
$punctuation = '/[^a-z0-9äöüß-]/';
$result = preg_split($punctuation, $text, -1, PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($result); $i++) {
$result[$i] = trim($result[$i]);
}
return $result; // contains the single words
}
$text = 'This is an example text, it contains commas and full-stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
print_r(tokenizer($text));
这是一个好方法吗?你有任何改进的想法吗?
提前致谢!
答案 0 :(得分:29)
使用匹配任何unicode标点符号的类\ p {P},并结合\ s空格类。
$result = preg_split('/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', $text, -1, PREG_SPLIT_NO_EMPTY);
这将拆分一组一个或多个空格字符,但也会吸入任何周围的标点字符。它还匹配字符串开头或结尾的标点字符。这歧视了诸如“不要”和“他说'哎哟!'”等案件。
答案 1 :(得分:12)
Tokenize - strtok。
<?php
$text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
$delim = ' \n\t,.!?:;';
$tok = strtok($text, $delim);
while ($tok !== false) {
echo "Word=$tok<br />";
$tok = strtok($delim);
}
?>
答案 2 :(得分:3)
我首先将字符串设置为小写,然后再将其拆分。这将使i
修饰符和之后的数组处理变得不必要。另外,我会使用\W
简写为非单词字符添加+
乘数。
$text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
$result = preg_split('/\W+/', strtolower($text), -1, PREG_SPLIT_NO_EMPTY);
修改使用Unicode character properties代替\W
as marcog suggested。像[\p{P}\p{Z}]
(标点符号和分隔符)这样的内容会覆盖比\W
更具体的字符。
答案 3 :(得分:1)
执行:
str_word_count($text, 1);
或者如果您需要unicode支持:
function str_word_count_Helper($string, $format = 0, $search = null)
{
$result = array();
$matches = array();
if (preg_match_all('~[\p{L}\p{Mn}\p{Pd}\'\x{2019}' . preg_quote($search, '~') . ']+~u', $string, $matches) > 0)
{
$result = $matches[0];
}
if ($format == 0)
{
return count($result);
}
return $result;
}
答案 4 :(得分:1)
您还可以使用PHP strtok()函数从大字符串中获取字符串标记。你可以像这样使用它:
$result = array();
// your original string
$text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
// you pass strtok() your string, and a delimiter to specify how tokens are separated. words are seperated by a space.
$word = strtok($text,' ');
while ( $word !== false ) {
$result[] = $word;
$word = strtok(' ');
}
详细了解strtok()
的php文档答案 5 :(得分:1)
您还可以使用爆炸方法:http://php.net/manual/en/function.explode.php
$words = explode(" ", $sentence);