我正在编写将文本分解为单词的代码,并进行诸如计算单词大小之类的事情。
我想了这个(经过一番搜索):
$text = preg_replace("/[^[:alnum:][:space:]]/u", ' ', $text);
$words = mb_split( ' +', $text );
但是,收缩不起作用,因为撇号和单引号看起来是相同的(因为它们是一样的。)
我需要一种方法来分离单词,但要包含收缩。现在,我已经包括了所有我认为可以作为停用词的收缩,但这是最不令人满意的。我对正则表达式不太满意,需要一些建议。
答案 0 :(得分:1)
找到了一种更好的方法,使用单词边界和单词允许的字符,您可以直接计算单词:
<?php
$text = "One morning, when Gregor Samsa woke from troubled dreams,
he found himself transformed in his bed into a horrible vermin.
'He lay on his armour-like back', and if he lifted his head a
little he could see his brown belly, slightly domed and divided by arches
into stiff sections. The bedding was hardly able to cover it and
seemed ready to slide off any moment. His many legs, pitifully thin
compared with the size of the rest of him, waved about helplessly as he
looked. \"What's happened to me?\" he thought. It wasn't a dream. His
room, a proper human room although a little too small, lay peacefully
between its four familiar walls. A collection of textile samples lay
spread out on the table - Samsa was a travelling salesman - and
above it there hung a picture that he had recently cut out of an
illustrated magazine and housed in a nice, gilded frame. It showed
a lady fitted out with a fur hat and fur boa who sat upright,
raising a heavy fur muff that covered the whole of her lower arm
towards the viewer. Gregor then turned to look out the window at the
dull weather";
preg_match_all("/\b[\w'-]+\b/", $text, $words);
print_r(count($words[0]));
注意:我允许带有'的-存在于一个单词中。像“盔甲一样”将被视为一个单词。
正则表达式测试:regexr.com/4ego6
答案 1 :(得分:0)
我已经为此工作了一段时间。评论和Taha Paksu非常有效的解决方案有助于帮助我思考问题。塔哈·帕克苏(Taha Paksu)的解决方案干净地隔离了单词,除非带有重音字母。 Google搜索似乎表明RegEx对非ASCII字符不太友好。
当我放弃尝试做正则表达式伏都教(谁能得到我最深的敬意)时,我想到了这个不太雅致的技巧。
$text = "Testing text. Café is spelled true. And pokémon too... ‘bad quotes’. (brackets)... Löwen, Bären, Vögel und Käfer sind Tiere. That’s what I said.";
$text = str_replace(array('’',"'"), '000AP000', $text);
$text = str_replace("-", '000HY000', $text);
$text = preg_replace("/[^[:alnum:][:space:]]/u", ' ', $text);
$text = str_replace('000AP000', "'", $text);
$text = str_replace('000HY000', "-", $text);
$text = str_replace(array("' ",'- ',' '," '",' -',' '), ' ', $text);
$words = mb_split( ' +', $text );
它使用两个统计上不太可能的字符串作为占位符,清理其余的字符串,放回连字符和撇号,然后取出所有接触空格(和多个空格)的东西。它适用于我能找到的所有东西。
如果可以的话,我想找到一个不太麻烦的解决方案,但是我的正则表达式技能可能无法胜任这项工作(即使备有备忘单)。