Question

=================

1。查找超过X分钟数的文件

2。处理从最旧到最新的

下面的代码工作正常，但该目录包含3百万个文件。因此，我需要优化它以更快地找到文件。我不必担心文件的内容只是名称。

###########################
sub get_files_to_process{
###########################
# Declare arrays
my @xmlfiles;
my @qulfiedfiles; 

# Declare a Dictionary
my %filedisc;

opendir(my $dh, $maindir) or die "opendir($maindir): $!";

 # Read all the files
 while (my $de = readdir($dh)) {
    # get the Full path of the file
    my $f = $maindir . $de;
    # If File is there and has .xml Extension
    if ( -f $f && $f=~ /\.xml/){
       # Put it in a XMLFILES Array
       push (@xmlfiles, $f); }
    }
    closedir($dh);


 # For every file in directory
 for my $file (@xmlfiles) {

    # Get stats about a file
    my @stats = stat($file);

    # If time stamp is older than minutes provided
    if ($stats[9] <= ($now - (( $minutesold * 60) ))){

       # Put the File and Time stamp in the dictionary
       $filedisc{$file} = $stats[9];
    }
 }

# For every file in the dictionary sort based on the timestamp oldest files first
 for my $x (sort {$filedisc{$a} <=> $filedisc{$b} or $a cmp $b } keys %filedisc) {

    # Put the qualified files (Based on the age) in a list
       push(@qulfiedfiles, $x);}

更新：到目前为止，这似乎很有希望，还有更多的测试要做：

##########################
sub get_files_count{
##########################

   my $cmd= "find $maindir -maxdepth 1 -name '*.xml' -mmin +$minutesold -printf \"%T+\t%p\\n\"| sort";
   my @output = `$cmd`;

   if (@output){
      foreach my $line (@output){
            chomp $line;
            push (@files2process, ( split '\t', $line )[ -1 ]);
         }
      }
   }

Answer 1

使用File :: Find

use File::Find

$\ = "\n";

my @files;

# find all files newer that 9 minutes
File::Find::find({wanted => \&wanted}, '.');

# sort them and print them
print for map { $_-[0] }  sort { $b->[1] <=> $a->[1] } @files;

exit;

sub wanted {
   ((-M) < (9 / (24 * 60))) && -f && push @files, [ $_, ( -M ) ];
}

这是递归的 - 所以它将遍历所有子目录（但我假设你的问题没有）。

此外，以上主要是来自find2perl的自动生成代码，它将大多数unix查找参数转换为perl脚本 - 酷而快。

我还没有用9分钟测试-M位 - 我在最后9分钟内没有保存任何东西。

Answer 2

我会分两步解决这个问题：

1）创建一个Linux::Inotify2进程，目录上的每次更改都会更新一些cahce文件（如Storable等）

e.g。您将获得所有文件统计信息的实际缓存。加载一个可存储文件比在每次运行时收集3M文件的统计数据更快

2）当需要搜索时，只加载Storable，搜索一个大哈希...

Answer 3

我知道这是一个老问题。我主要是为“后代”回答它。

你的大部分时间很可能用于排序 300万个文件条目，因为排序操作是非线性的（即排序变得越来越慢，你拥有的文件越多），也因为大多数统计调用都发生在比较中，这主要是由于排序而发生的。（此外，文件列表可能会占据相当大的一部分内存。）

因此，如果您可以避免排序，您还将自动避免大部分统计调用并节省大量时间。由于您的任务只是“将文件移动到适当的目录中”，我只需为您找到的符合条件的每个文件调用处理方法，找到它的那一刻，而不是先创建一个巨大的文件list，使用一堆循环对它进行排序，然后浏览庞大的列表并以一种不一定需要排序的方式处理它。

来自您自己脚本的示例：“find”与“ls”不同，不在内存中创建文件列表 - 它会在找到每个文件时执行命令。这就是为什么它不会爆炸与巨大的目录，不像“ls”。就像发现那样做^^

Perl：查找超过X分钟的文件的最快方法，从最旧到最新排序？

3 个答案: