Question

我有一系列短语（最多2个单词），如

$words = array('barack obama', 'chicago', 'united states');

然后我有一个像：

这样的字符串

$sentence = "Barack Obama is from Chicago. Barack Obama's favorite food it pizza.";

我想找到/创建一个有效的算法，它将返回字符串$ sentence中数组$ words中出现的字数。在这种情况下，它将是：

'barack obama' => 2
'chicago' => 0

我该如何构建它？

Answer 1

阅读有关substr_count的文档。将它用在超过$ words的循环中。

 $res = array();
 foreach($words as $word){
    $res[$word] = substr_count($sentence,$word);
 }

Answer 2

这在自然语言处理中称为实体提取。在您的示例中它可能看起来很简单，但它可能会变得非常复杂。如果您要认真使用它，您应该考虑使用NLTK，OpenNLP和Lucene等工具包。

Answer 3

像这样的事情会做到这一点。

$res = array();
foreach($words as $word){
  $res[$word] = preg_match_all("/{$word}/i", $sentence);
}

注意：因为它使用正则表达式，你必须确保你的单词没有正则表达式符号并将它们转义，同样基于str_pos的解决方案可能会表现得更好所以它取决于数量你需要分析的句子和所涉及的单词数量。

使用@Ofri解决方案

$res = array();
foreach($words as $word){
  $res[$word] = substr_count($sentence,$word);
}

Answer 4

这是另一个正则表达式实现：

$words = array('barack obama', 'chicago', 'united states');
$sentence = "Barack Obama is from Chicago. Barack Obama's favorite food it pizza. He is president of the United States";
$re= sprintf('/(%s)/i', implode('|',  $words));
if (preg_match_all($re, $sentence, $m))
 print_r(array_count_values($m[0]));

易于扩展 - 只需更新$words和$sentence即可。

帮助算法确定PHP中字符串中的短语出现

4 个答案: