Question

我有一个简单的find命令，需要在服务器上查看数百万个文件，并找到一些具有给定后缀的文件。随着时间的推移，文件会被频繁地写入和删除。我只是想知道是否有办法让查找更快。使用locate是不可能的，因为为locate定位数据库将非常昂贵。

find /myDirWithThausandsofDirectories/ -name *.suffix

在某些服务器上，此命令需要数天！

有什么想法吗？

谢谢，

Answer 1

分而治之？假设一个MP操作系统和处理器为每个子文件夹生成多个find命令。

for dir in /myDirWithThausandsofDirectories/*
do find "$dir" -name "*.suffix" &
done

取决于您可能希望控制在给定时间运行多少进程（find命令）的子目录数。这将有点棘手，但可行（即使用bash shell，使用生成的进程$!的pids保留一个数组，并且只允许新的数组，具体取决于数组的长度）。此外，上面不会搜索根目录下的文件，只是这个想法的一个简单示例。

如果你不知道如何处理管理，那么学习的时间;） This是关于这个主题的非常好的文字。 This实际上是你需要的。但请阅读整篇文章，了解其工作原理。

Answer 2

您可以使用审核子系统来监控文件的创建和删除。将此与find的初始运行结合使用，可以创建可以实时更新的文件数据库。

Answer 3

由于你使用的是简单的glob，你可以使用 Bash的递归globbing 。例如：

shopt -s globstar
for path in /etc/**/**.conf
do
    echo "$path"
done

可能会更快，因为它使用的内部shell功能的灵活性远远低于find。

如果您不能使用Bash，但是您对路径深度有限制，则可以明确列出不同的深度：

for path in /etc/*/*.conf /etc/*/*/*.conf /etc/*/*/*/*.conf
do
    echo "$path"
done

Answer 4

以下是代码：

find /myDirWithThausandsofDirectories/ -d type maxdepth 1 > /tmp/input
IFS=$'\n' read -r -d '' -a files < /tmp/input


do_it() {
   for f; do find $f  -name *.suffix | sed -e s/\.suffix//g ; done
}

# Divide the list into 5 sub-lists.
i=0 n=0 a=() b=() c=() d=() e=()
while ((i < ${#files[*]})); do
    a[n]=${files[i]}
    b[n]=${files[i+1]}
    c[n]=${files[i+2]}
    d[n]=${files[i+3]}
    e[n]=${files[i+4]}
    ((i+=5, n++))
done

# Process the sub-lists in parallel
do_it "${a[@]}" >> /tmp/f.unsorted 2>/tmp/f.err &
do_it "${b[@]}" >> /tmp/f.unsorted 2>/tmp/f.err &
do_it "${c[@]}" >> /tmp/f.unsorted 2>/tmp/f.err &
do_it "${d[@]}" >> /tmp/f.unsorted 2>/tmp/f.err &
do_it "${e[@]}" >> /tmp/f.unsorted 2>/tmp/f.err &
wait
echo Find is Done!

我遇到的唯一问题是一些文件名（非常小的百分比）部分出局。我不知道会是什么原因！

快壳发现

4 个答案: