Question

我需要尽快创建一个基于搜索引擎的简单文本文件（使用PHP）！基本上它必须读取目录中的文件，删除停止和无用的单词，将每个剩余的有用单词编入每个文档中出现的次数。

我猜这个伪代码是：

for each file in directory:
    read in contents,
    compare to stop words,
    add each remaining word to array,
    count how many times that word appears in document,
    add that number to the array,
    add the id/name of the file to the array,

还需要计算整个文件中的单词总量（在无用的删除之后），只要我可以从该数组中获取文件ID然后计算其中的单词，我就可以在之后进行猜测。 ...？

任何人都可以提供帮助，也许可以提供一个准系统结构？我认为我需要帮助的主要部分是获取每个单词出现在文档中的次数并将其添加到索引数组...

由于

Answer 1

看看str_word_count。它计算单词，但也可以将它们提取到一个数组（数组中的每个值都是一个单词）。然后，您可以对此数组进行后处理，以删除停用词，计算出现次数等。

Answer 2

使用glob
确定目录中的每个文件都应该很简单然后可以使用读取文件 file_get_contents

/**
 * This is how you will add extra rows
 * 
 * $index[] = array(
 *  'filename' => 'airlines.txt',
 *  'word' => 'JFK',
 *  'count' => 3,
 *  'all_words_count' => 42
 * );
*/
$index = array();

$words = array('jfk', 'car');

foreach( $words as $word ) {

  // All files with a .txt extension
  // Alternate way would be "/path/to/dir/*"
  foreach (glob("test_files/*.txt") as $filename) {

    // Includes the file based on the include_path
    $content = file_get_contents($filename, true);

    $count = 0;

    $totalCount = str_word_count($content);

    if( preg_match_all('/' . $word . '/i', $content, $matches) ) {
      $count = count($matches[0]);
    }

    // And another item to the list
    $index[] = array(
        'filename' => $filename,
        'word' => $word,
        'count' => $count,
        'all_words_count' => $totalCount
      );

  }

}

// Debug and look at the index array,
// make sure it looks the way you want it.
echo '<pre>';
print_r($index);
echo '</pre>';

当我测试上面的代码时，这就是我得到的。

Array
(
    [0] => Array
        (
            [filename] => test_files/airlines.txt
            [word] => jfk
            [count] => 2
            [all_words_count] => 38
        )

    [1] => Array
        (
            [filename] => test_files/rentals.txt
            [word] => jfk
            [count] => 0
            [all_words_count] => 47
        )

    [2] => Array
        (
            [filename] => test_files/airlines.txt
            [word] => car
            [count] => 0
            [all_words_count] => 38
        )

    [3] => Array
        (
            [filename] => test_files/rentals.txt
            [word] => car
            [count] => 3
            [all_words_count] => 47
        )

)

我想我已经解决了你的问题：D将此添加到上述脚本之后，您应该可以对计数进行排序，从$sorted开始，从最高$sorted_desc <开始/ p>

function sorter($a, $b) {
  if( $a['count'] == $b['count'] )
    return 0;

  return ($a['count'] < $b['count']) ? -1 : 1;
}

// Clone the original list
$sorted = $index;

// Run a custom sort function
uasort($sorted, 'sorter');

// Reverse the array to find the highest first
$sorted_desc = array_reverse($sorted);

// Debug and look at the index array,
// make sure it looks the way you want it.
echo '<h1>Ascending</h1><pre>';
print_r($sorted);
echo '</pre>';

echo '<h1>Descending</h1><pre>';
print_r($sorted_desc);
echo '</pre>';

Answer 3

$words=array();
foreach (glob('*') as $file) {
    $contents=file_get_contents($file);
    $words[$file]=array();
    preg_match_all('/\S+/',$contents,$matches,PREG_SET_ORDER);
    foreach ($matches as $match) {
        if (!isset($words[$file][$match[0]))
            $words[$file][$match[0]]=0;
        $words[$file][$match[0]]++;
    }
    foreach ($useless as $value)
        if (isset($words[$file][$value]))
            unset($words[$file][$value]);
    $count=count($words[$file]);
    var_dump($words[$file]);
    echo 'Number of words: '.$count;
}

Answer 4

这是一个基本结构：

创建$index数组
如果您只需要获取某种类型的文件，请使用scandir（或glob来获取目录中的文件。
对于每个文件：
1. 使用file_get_contents
2. 使用str_word_count获取字流
3. 创建数组$word_array以保留字数
4. $word_stream中的每个字词：
  1. 如果它在$ignored_words数组中，请跳过它
  2. 如果它不在$word_array作为键，请添加$word_array[$word] = 1
  3. 如果它已在$word_array中，请增加$word_array[$word]++
5. 获取$word_array与array_sum的总和，或与count的唯一字总和;如果您愿意，可以使用密钥$word_array和"_unique"（不会是文字）将其添加到"_count"
6. 将文件名添加为$index数组的键，值为$word_array

创建一个基于简单文本文件的搜索引擎

4 个答案: