我需要尽快创建一个基于搜索引擎的简单文本文件(使用PHP)!基本上它必须读取目录中的文件,删除停止和无用的单词,将每个剩余的有用单词编入每个文档中出现的次数。
我猜这个伪代码是:
for each file in directory: read in contents, compare to stop words, add each remaining word to array, count how many times that word appears in document, add that number to the array, add the id/name of the file to the array,
还需要计算整个文件中的单词总量(在无用的删除之后),只要我可以从该数组中获取文件ID然后计算其中的单词,我就可以在之后进行猜测。 ...?
任何人都可以提供帮助,也许可以提供一个准系统结构?我认为我需要帮助的主要部分是获取每个单词出现在文档中的次数并将其添加到索引数组...
由于
答案 0 :(得分:1)
看看str_word_count。它计算单词,但也可以将它们提取到一个数组(数组中的每个值都是一个单词)。然后,您可以对此数组进行后处理,以删除停用词,计算出现次数等。
答案 1 :(得分:1)
使用glob
确定目录中的每个文件都应该很简单
然后可以使用读取文件
file_get_contents
/**
* This is how you will add extra rows
*
* $index[] = array(
* 'filename' => 'airlines.txt',
* 'word' => 'JFK',
* 'count' => 3,
* 'all_words_count' => 42
* );
*/
$index = array();
$words = array('jfk', 'car');
foreach( $words as $word ) {
// All files with a .txt extension
// Alternate way would be "/path/to/dir/*"
foreach (glob("test_files/*.txt") as $filename) {
// Includes the file based on the include_path
$content = file_get_contents($filename, true);
$count = 0;
$totalCount = str_word_count($content);
if( preg_match_all('/' . $word . '/i', $content, $matches) ) {
$count = count($matches[0]);
}
// And another item to the list
$index[] = array(
'filename' => $filename,
'word' => $word,
'count' => $count,
'all_words_count' => $totalCount
);
}
}
// Debug and look at the index array,
// make sure it looks the way you want it.
echo '<pre>';
print_r($index);
echo '</pre>';
当我测试上面的代码时,这就是我得到的。
Array
(
[0] => Array
(
[filename] => test_files/airlines.txt
[word] => jfk
[count] => 2
[all_words_count] => 38
)
[1] => Array
(
[filename] => test_files/rentals.txt
[word] => jfk
[count] => 0
[all_words_count] => 47
)
[2] => Array
(
[filename] => test_files/airlines.txt
[word] => car
[count] => 0
[all_words_count] => 38
)
[3] => Array
(
[filename] => test_files/rentals.txt
[word] => car
[count] => 3
[all_words_count] => 47
)
)
我想我已经解决了你的问题:D将此添加到上述脚本之后,您应该可以对计数进行排序,从$sorted
开始,从最高$sorted_desc
<开始/ p>
function sorter($a, $b) {
if( $a['count'] == $b['count'] )
return 0;
return ($a['count'] < $b['count']) ? -1 : 1;
}
// Clone the original list
$sorted = $index;
// Run a custom sort function
uasort($sorted, 'sorter');
// Reverse the array to find the highest first
$sorted_desc = array_reverse($sorted);
// Debug and look at the index array,
// make sure it looks the way you want it.
echo '<h1>Ascending</h1><pre>';
print_r($sorted);
echo '</pre>';
echo '<h1>Descending</h1><pre>';
print_r($sorted_desc);
echo '</pre>';
答案 2 :(得分:1)
$words=array();
foreach (glob('*') as $file) {
$contents=file_get_contents($file);
$words[$file]=array();
preg_match_all('/\S+/',$contents,$matches,PREG_SET_ORDER);
foreach ($matches as $match) {
if (!isset($words[$file][$match[0]))
$words[$file][$match[0]]=0;
$words[$file][$match[0]]++;
}
foreach ($useless as $value)
if (isset($words[$file][$value]))
unset($words[$file][$value]);
$count=count($words[$file]);
var_dump($words[$file]);
echo 'Number of words: '.$count;
}
答案 3 :(得分:0)
这是一个基本结构:
$index
数组scandir
(或glob
来获取目录中的文件。file_get_contents
str_word_count
获取字流$word_stream
$word_array
以保留字数$word_stream
中的每个字词:
$ignored_words
数组中,请跳过它$word_array
作为键,请添加$word_array[$word] = 1
$word_array
中,请增加$word_array[$word]++
$word_array
与array_sum
的总和,或与count
的唯一字总和;如果您愿意,可以使用密钥$word_array
和"_unique"
(不会是文字)将其添加到"_count"
$index
数组的键,值为$word_array