.net在所有目录中查找与模式匹配的所有文件的最快方法

时间:2014-01-22 15:20:36

标签: vb.net

我有* 500K的* .ax5文件,我必须处理并导出到另一种格式。由于文件数量很多,并且由于Windows性能问题导致一个文件夹中的文件太多,因此它们被隐藏在具有不同扩展名的其他文件的子文件夹中。在C#中,找到C:\ Sketch下任何级别的子文件夹中包含的每个文件的最快方法是什么?

初始运行后,文件夹结构始终与AAAA \ BB \ CCCC_BLD [一堆不同的文件类型]相同,我还希望只处理写入日期大于上次运行日期的文件

或者,如何快速获取显示处理百分比的记录数?

我无法更改供应商设置的文件/文件夹的源结构

这就是我所拥有的。我已经尝试Array.ForEachParallel.ForEach两者看起来都很慢。

Sub walkTree(ByVal directory As DirectoryInfo, ByVal pattern As String)
    Array.ForEach(directory.EnumerateFiles(pattern).ToArray(), Sub(fileInfo)
                                                                   Export(fileInfo)
                                                               End Sub)
    For Each subDir In directory.EnumerateDirectories()
        walkTree(subDir, pattern)    
    Next
End Sub

3 个答案:

答案 0 :(得分:7)

http://msdn.microsoft.com/en-us/library/ms143316(v=vs.110).aspx

Directory.GetFiles(@"C:\Sketch", "*.ax5", SearchOption.AllDirectories);

可能对你好吗?


至于性能,我怀疑你会发现任何更快的扫描目录的方法,因为@Mathew Foscarini指出,你的磁盘是这里的瓶颈。

如果目录已编入索引,那么将其用作@jaccus提及的速度会更快。


我花了一些时间对事情进行基准测试。实际上,您似乎能够以异步方式收集文件,从而获得33%的性能提升。

我运行的测试集可能与您的情况不符,我不知道您的文件是如何嵌套的......但我所做的是在每个级别的每个目录中创建5000个随机文件(我已经确定了单级但是100个目录共计505.000个文件...

我测试了3种收集文件的方法......

最简单的方法。

public class SimpleFileCollector
{
    public List<string> CollectFiles(DirectoryInfo directory, string pattern)
    {
        return new List<string>( Directory.GetFiles(directory.FullName, pattern, SearchOption.AllDirectories));
    }
}

“哑”方法,虽然如果您知道Simple方法中使用的重载,这只是愚蠢...否则这是一个非常好的解决方案。

public class DumbFileCollector
{
    public List<string> CollectFiles(DirectoryInfo directory, string pattern)
    {
        List<string> files = new List<string>(500000);
        files.AddRange(directory.GetFiles(pattern).Select(file => file.FullName));

        foreach (DirectoryInfo dir in directory.GetDirectories())
        {
            files.AddRange(CollectFiles(dir, pattern));
        }
        return files;
    }
}

任务API方法......

public class ThreadedFileCollector
{
    public List<string> CollectFiles(DirectoryInfo directory, string pattern)
    {
        ConcurrentQueue<string> queue = new ConcurrentQueue<string>();
        InternalCollectFiles(directory, pattern, queue);
        return queue.AsEnumerable().ToList();
    }

    private void InternalCollectFiles(DirectoryInfo directory, string pattern, ConcurrentQueue<string> queue)
    {
        foreach (string result in directory.GetFiles(pattern).Select(file => file.FullName))
        {
            queue.Enqueue(result);
        }

        Task.WaitAll(directory
            .GetDirectories()
            .Select(dir => Task.Factory.StartNew(() => InternalCollectFiles(dir, pattern, queue))).ToArray());
    }
}

这只是收集所有文件的测试。如果不处理它们,那么开始处理线程是有意义的。

以下是我系统的结果:

Simple Collector:
 - Pass 0: found 505000 files in 2847 ms
 - Pass 1: found 505000 files in 2865 ms
 - Pass 2: found 505000 files in 2860 ms
 - Pass 3: found 505000 files in 3061 ms
 - Pass 4: found 505000 files in 3006 ms
 - Pass 5: found 505000 files in 2807 ms
 - Pass 6: found 505000 files in 2849 ms
 - Pass 7: found 505000 files in 2789 ms
 - Pass 8: found 505000 files in 2790 ms
 - Pass 9: found 505000 files in 2788 ms
Average: 2866 ms

Dumb Collector:
 - Pass 0: found 505000 files in 5190 ms
 - Pass 1: found 505000 files in 5204 ms
 - Pass 2: found 505000 files in 5453 ms
 - Pass 3: found 505000 files in 5311 ms
 - Pass 4: found 505000 files in 5339 ms
 - Pass 5: found 505000 files in 5362 ms
 - Pass 6: found 505000 files in 5316 ms
 - Pass 7: found 505000 files in 5319 ms
 - Pass 8: found 505000 files in 5583 ms
 - Pass 9: found 505000 files in 5197 ms
Average: 5327 ms

Threaded Collector:
 - Pass 0: found 505000 files in 2152 ms
 - Pass 1: found 505000 files in 2102 ms
 - Pass 2: found 505000 files in 2022 ms
 - Pass 3: found 505000 files in 2030 ms
 - Pass 4: found 505000 files in 2075 ms
 - Pass 5: found 505000 files in 2120 ms
 - Pass 6: found 505000 files in 2030 ms
 - Pass 7: found 505000 files in 1980 ms
 - Pass 8: found 505000 files in 1993 ms
 - Pass 9: found 505000 files in 2120 ms
Average: 2062 ms

作为旁注,@ Konrad Kokosa建议阻止每个目录以确保不启动数百万个线程,不要那样做......

没有理由让你管理在给定时间有效的线程数,让任务框架标准调度程序处理它,它将在根据数量平衡线程数方面做得更好。你有核心...

如果你真的不想控制自己只是因为,实施自定义调度程序将是一个更好的选择:http://msdn.microsoft.com/en-us/library/system.threading.tasks.taskscheduler(v=vs.110).aspx

答案 1 :(得分:1)

您可能需要尝试Windows Search API或Indexing Service API。

答案 2 :(得分:0)

通常,仅搜索场景中的并行性可能只会产生额外的开销。但是如果Export在某种程度上代价高昂,那么在多线程的帮助下你可能会获得一些性能上的好处。下面是在C#和VB.NET(已测试)中生成多线程版本的代码:

public static async Task<IEnumerable<string>> ProcessDirectoryAsync(string path, string searchPattern)
{
    var files = Directory.EnumerateFiles(path, searchPattern, SearchOption.TopDirectoryOnly);
    var subdirs = Directory.EnumerateDirectories(path, "*", SearchOption.TopDirectoryOnly);
    var results = await Task.WhenAll(files.Select(f => Task.Run(() => ExportFile(f))));
    var subresults = await Task.WhenAll(subdirs.Select(dir => Task.Run(() => ProcessDirectoryAsync(dir, searchPattern))));
    return results.Concat(subresults.SelectMany(r => r));
}

Public Shared Async Function ProcessDirectoryAsync(path As String, searchPattern As String) As Task(Of IEnumerable(Of String))
    Dim source As IEnumerable(Of String) = Directory.EnumerateFiles(path, searchPattern, SearchOption.TopDirectoryOnly)
    Dim source2 As IEnumerable(Of String) = Directory.EnumerateDirectories(path, "*", SearchOption.TopDirectoryOnly)
    Dim first As String() = Await Task.WhenAll(Of String)(
        source.Select(Function(f As String) Task.Run(Of String)(
                          Function() ExportFile(f))
                      ))
    Dim source3 As IEnumerable(Of String)() =
        Await Task.WhenAll(Of IEnumerable(Of String))(
            source2.Select(Function(dir As String) _
                               Task.Run(Of IEnumerable(Of String))(
                                   Function() ProcessDirectoryAsync(dir, searchPattern)
                               )))
    Return first.Concat(source3.SelectMany(Function(r As IEnumerable(Of String)) r))
End Function

等待防止此代码为每个目录/文件生成数百万个线程。我做的快速测试显示了5-6个工作线程正在完成它的工作。性能提升可能是4倍。