Question

我有大量的关键字（超过一千个），我需要搜索一个大的HTML文件以查找文本中存在哪些关键字。然后，我需要返回找到的这些关键字的索引。

例如，如果我的数组是：

$keywords = array("love", "money", "minute", "loop"); // etc.

并且如果有单词“ money”和“ loop”的任何实例，我想获取以下数组：

$results = array("1", "3"); // first $keyword element is 0

我尝试使用preg_match_all，但我不确定如何获取$ matches返回关键字索引。

这是我到目前为止的代码：

$keywords = array("love", "money", "minute", "loop");

$html = file_get_contents($url);

preg_match_all("#(love|money|minute|loop)#i", $html, $matches);

var_dump($matches);

结果是这样的：

array(2) {
  [0]=>
  array(4) {
    [0]=>
    string(6) "minute"
    [1]=>
    string(6) "minute"
    [2]=>
    string(5) "money"
    [3]=>
    string(5) "Money"
  }
  [1]=>
  array(4) {
    [0]=>
    string(6) "minute"
    [1]=>
    string(6) "minute"
    [2]=>
    string(5) "money"
    [3]=>
    string(5) "Money"
  }
}

在PHP中最快/最优化的方法是什么？ preg_match_all可以吗？我想避免使用foreach，这会使我的函数对整个HTML进行超过一千次的爬网（不是很有效）。
如何获取我的关键字的索引？例如。不论其数量如何，找到的关键字都是0和3。

Answer 1

您可以使用PREG_OFFSET_CAPTURE标志获取偏移量：

$matches=[];
$html = "love and money make the world loop around in a loop three times per minute";
preg_match_all("#love|money|minute|loop#i", $html, $matches, PREG_OFFSET_CAPTURE);
foreach ($matches[0] as $m) echo $m[0]." found at index ".$m[1]."\n";

// output:
love found at index 0
money found at index 9
loop found at index 30
loop found at index 47
minute found at index 68

现在，此程序执行足够快，供您评估。如果是这样，那就没有必要寻找更复杂的替代方案了。

Answer 2

$keywords = array("love", "money", "minute", "loop");

// The function "GetHtmlWords" gets the html content and clean it from spacial 
// characters
$htmlWordsArray = explode(' ', GetHtmlWords($url));

// Calculate the intersection - intersect return values while preserving keys
// use array_keys to get just the keys. double check if first index is 0 or 1
$result = array_keys(array_intersect($keywords, $htmlWordsArray));

var_dump($result);

// Get the content of the html, cleaned from spacial characters, with space 
// between words
function GetHtmlWords($url) {
  $htmlContent = file_get_contents($url);

  // Handle , and . that may split between words, without space.
  // for example hi.there first,second
  $html = $str_replace([".",","], " ", $htmlContent);

  // Clean the text from spacial characters (including , and .)
  $cleanHtml = preg_replace('/[^A-Za-z0-9\- ]/', '', $html)

  // Remove duplicate spaces
  $htmlWordsOnly = $str_replace("  ", " ", $html);

  return($htmlWordsOnly);
}

Answer 3

只是使用str_word_count()的一种替代方案，您不会看到太多，将2作为第二个参数将字符串拆分为以起始位置为键的数组中的单词。然后使用array_intersect()将其与关键字相匹配...

$keywords = array("love", "money", "minute", "loop");
// string courtesy of Joni's answer
$html = "love and money make the world loop around in a loop three times per minute";
$words = str_word_count($html, 2);
$match = array_intersect($words, $keywords);
print_r($match);

给予...

Array
(
    [0] => love
    [9] => money
    [30] => loop
    [47] => loop
    [68] => minute
)

不确定如何针对任何正则表达式执行此操作，只需尝试一下即可。

或者由于屏幕空间不足...

print_r(array_intersect(str_word_count($html, 2), $keywords));

如果您只想知道是否存在关键字，则只需颠倒array_intersect()中数组的顺序（并且不区分大小写-首先使用strtolower()转换为小写）...

$match = array_intersect($keywords, str_word_count(strtolower($html), 1));

这给...

Array
(
    [0] => love
    [1] => money
    [2] => minute
    [3] => loop
)

PHP-查找文本中多个关键字的最快方法？

4 个答案: