Question

这是一个过去工作正常的脚本，但是现在我正在处理大量的inode（大约400K），它似乎产生了一些I / O速度。该脚本读取定义文件“def”，它是一个标识符列表，对于“dir”目录中的每个400K文件，如果在4个第一行中找到其中一个标识符，它将附加整个文件内容。结束“def”特定文件。

#!/bin/sh
for def in *.def
do
        touch $def.out
        for file in $dir/*
        do
                if head -4 $file | grep -q -f  $def
                then
                        cat $file >> $def.out
                fi
        done
done

如何让它更快？

Answer 1

Perl解决方案。它应该比你的脚本快得多，因为

它从每个.def文件创建一个正则表达式。它不会多次读取每个.def文件。
它使用opendir来读取目录内容。它比执行* glob要快得多，但作为惩罚，文件不会被排序。要比较你和我的脚本的输出，你必须使用
```
diff <(sort $def.out) <(sort $def-new.out)
```
您可以将opendir替换为glob以获得完全相同的输出。它会减慢脚本速度，但它仍然比旧脚本快得多。

脚本在这里：

#!/usr/bin/perl
use warnings;
use strict;

my $dir = 'd';              # Enter your dir here.

my @regexen;
my @defs = glob '*.def';
for my $def (@defs) {
    open my $DEF,   '<', $def           or die "$def: $!";
    open my $TOUCH, '>', "$def-new.out" or die "$def-new.out: $!";
    my $regex = q();
    while (<$DEF>) {
        chomp;
        $regex .= "$_|"
    }
    substr $regex, -1, 1, q();
    push @regexen, qr/$regex/;
}

# If you want the same order, uncomment the following 2 lines and comment the next 2 ones.
#
# for my $file (glob "$dir/*") {
#     $file =~ s%.*/%%;

opendir my $DIR, $dir or die "$dir: $!";
while (my $file = readdir $DIR) {
    next unless -f "$dir/$file";

    my %matching_files;
    open my $FH, '<', "$dir/$file" or die "$dir/$file: $!";
    while (my $line = <$FH>) {
        last if $. > 4;
        my @matches = map $line =~ /$_/ ? 1 : 0, @regexen;
        $matching_files{$_}++ for grep $matches[$_], 0 .. $#defs;
    }

    for my $i (keys %matching_files) {
        open my $OUT, '>>', "$defs[$i]-new.out" or die "$defs[$i]-new.out: $!";
        open my $IN,  '<',  "$dir/$file"        or die "$dir/$file: $!";
        print $OUT $_ while <$IN>;
        close $OUT;
    }
}

更新

现在可以多次提取文件。不是创建一个巨大的正则表达式，而是创建一个regexen数组，并逐个匹配它们。

Answer 2

我发现当我在一个文件夹中有超过10,000个文件时，我开始看到一些性能问题。发生这种情况时，即使ls命令也可能需要几秒钟才能返回。

你的脚本似乎天生就是IO重。它正在查看大量文件并创建或附加到大量文件。如果不改变脚本的运行方式，我看不到任何可以改进的内容。

如果可以，请将部分数据移至数据库中。与文件系统相比，数据库可以更容易地调整到这种数据范围。

Answer 3

你可以节省很多叉子;保存在循环中的一个fork为整个脚本提供了总共400K的分叉。这就是我要做的事。

不是触摸每个* .def，而是用大块触摸它们：

find . -name '*.def' | sed 's/\(.*\)/\1.out/' | xargs touch

（如果您的发现支持，请使用find . -maxdepth 1 ...）

而不是两个命令管道，只需一个命令：

if awk "NR <= 4 && /$def/ { exit 0 } NR==5 { exit 1 }" $file; then

（如果它不包含元字符，请检查你的$ def。一个点应该没问题。）

优化“搜索和抓取”sh脚本

3 个答案:

更新