我有* 500K的* .ax5文件,我必须处理并导出到另一种格式。由于文件数量很多,并且由于Windows性能问题导致一个文件夹中的文件太多,因此它们被隐藏在具有不同扩展名的其他文件的子文件夹中。在C#中,找到C:\ Sketch下任何级别的子文件夹中包含的每个文件的最快方法是什么?
初始运行后,文件夹结构始终与AAAA \ BB \ CCCC_BLD [一堆不同的文件类型]相同,我还希望只处理写入日期大于上次运行日期的文件
或者,如何快速获取显示处理百分比的记录数?
我无法更改供应商设置的文件/文件夹的源结构
这就是我所拥有的。我已经尝试Array.ForEach
和Parallel.ForEach
两者看起来都很慢。
Sub walkTree(ByVal directory As DirectoryInfo, ByVal pattern As String)
Array.ForEach(directory.EnumerateFiles(pattern).ToArray(), Sub(fileInfo)
Export(fileInfo)
End Sub)
For Each subDir In directory.EnumerateDirectories()
walkTree(subDir, pattern)
Next
End Sub
答案 0 :(得分:7)
http://msdn.microsoft.com/en-us/library/ms143316(v=vs.110).aspx
Directory.GetFiles(@"C:\Sketch", "*.ax5", SearchOption.AllDirectories);
可能对你好吗?
至于性能,我怀疑你会发现任何更快的扫描目录的方法,因为@Mathew Foscarini指出,你的磁盘是这里的瓶颈。
如果目录已编入索引,那么将其用作@jaccus提及的速度会更快。
我花了一些时间对事情进行基准测试。实际上,您似乎能够以异步方式收集文件,从而获得33%的性能提升。
我运行的测试集可能与您的情况不符,我不知道您的文件是如何嵌套的......但我所做的是在每个级别的每个目录中创建5000个随机文件(我已经确定了单级但是100个目录共计505.000个文件...
我测试了3种收集文件的方法......
最简单的方法。
public class SimpleFileCollector
{
public List<string> CollectFiles(DirectoryInfo directory, string pattern)
{
return new List<string>( Directory.GetFiles(directory.FullName, pattern, SearchOption.AllDirectories));
}
}
“哑”方法,虽然如果您知道Simple方法中使用的重载,这只是愚蠢...否则这是一个非常好的解决方案。
public class DumbFileCollector
{
public List<string> CollectFiles(DirectoryInfo directory, string pattern)
{
List<string> files = new List<string>(500000);
files.AddRange(directory.GetFiles(pattern).Select(file => file.FullName));
foreach (DirectoryInfo dir in directory.GetDirectories())
{
files.AddRange(CollectFiles(dir, pattern));
}
return files;
}
}
任务API方法......
public class ThreadedFileCollector
{
public List<string> CollectFiles(DirectoryInfo directory, string pattern)
{
ConcurrentQueue<string> queue = new ConcurrentQueue<string>();
InternalCollectFiles(directory, pattern, queue);
return queue.AsEnumerable().ToList();
}
private void InternalCollectFiles(DirectoryInfo directory, string pattern, ConcurrentQueue<string> queue)
{
foreach (string result in directory.GetFiles(pattern).Select(file => file.FullName))
{
queue.Enqueue(result);
}
Task.WaitAll(directory
.GetDirectories()
.Select(dir => Task.Factory.StartNew(() => InternalCollectFiles(dir, pattern, queue))).ToArray());
}
}
这只是收集所有文件的测试。如果不处理它们,那么开始处理线程是有意义的。
以下是我系统的结果:
Simple Collector:
- Pass 0: found 505000 files in 2847 ms
- Pass 1: found 505000 files in 2865 ms
- Pass 2: found 505000 files in 2860 ms
- Pass 3: found 505000 files in 3061 ms
- Pass 4: found 505000 files in 3006 ms
- Pass 5: found 505000 files in 2807 ms
- Pass 6: found 505000 files in 2849 ms
- Pass 7: found 505000 files in 2789 ms
- Pass 8: found 505000 files in 2790 ms
- Pass 9: found 505000 files in 2788 ms
Average: 2866 ms
Dumb Collector:
- Pass 0: found 505000 files in 5190 ms
- Pass 1: found 505000 files in 5204 ms
- Pass 2: found 505000 files in 5453 ms
- Pass 3: found 505000 files in 5311 ms
- Pass 4: found 505000 files in 5339 ms
- Pass 5: found 505000 files in 5362 ms
- Pass 6: found 505000 files in 5316 ms
- Pass 7: found 505000 files in 5319 ms
- Pass 8: found 505000 files in 5583 ms
- Pass 9: found 505000 files in 5197 ms
Average: 5327 ms
Threaded Collector:
- Pass 0: found 505000 files in 2152 ms
- Pass 1: found 505000 files in 2102 ms
- Pass 2: found 505000 files in 2022 ms
- Pass 3: found 505000 files in 2030 ms
- Pass 4: found 505000 files in 2075 ms
- Pass 5: found 505000 files in 2120 ms
- Pass 6: found 505000 files in 2030 ms
- Pass 7: found 505000 files in 1980 ms
- Pass 8: found 505000 files in 1993 ms
- Pass 9: found 505000 files in 2120 ms
Average: 2062 ms
作为旁注,@ Konrad Kokosa建议阻止每个目录以确保不启动数百万个线程,不要那样做......
没有理由让你管理在给定时间有效的线程数,让任务框架标准调度程序处理它,它将在根据数量平衡线程数方面做得更好。你有核心...
如果你真的不想控制自己只是因为,实施自定义调度程序将是一个更好的选择:http://msdn.microsoft.com/en-us/library/system.threading.tasks.taskscheduler(v=vs.110).aspx
答案 1 :(得分:1)
您可能需要尝试Windows Search API或Indexing Service API。
答案 2 :(得分:0)
通常,仅搜索场景中的并行性可能只会产生额外的开销。但是如果Export
在某种程度上代价高昂,那么在多线程的帮助下你可能会获得一些性能上的好处。下面是在C#和VB.NET(已测试)中生成多线程版本的代码:
public static async Task<IEnumerable<string>> ProcessDirectoryAsync(string path, string searchPattern)
{
var files = Directory.EnumerateFiles(path, searchPattern, SearchOption.TopDirectoryOnly);
var subdirs = Directory.EnumerateDirectories(path, "*", SearchOption.TopDirectoryOnly);
var results = await Task.WhenAll(files.Select(f => Task.Run(() => ExportFile(f))));
var subresults = await Task.WhenAll(subdirs.Select(dir => Task.Run(() => ProcessDirectoryAsync(dir, searchPattern))));
return results.Concat(subresults.SelectMany(r => r));
}
Public Shared Async Function ProcessDirectoryAsync(path As String, searchPattern As String) As Task(Of IEnumerable(Of String))
Dim source As IEnumerable(Of String) = Directory.EnumerateFiles(path, searchPattern, SearchOption.TopDirectoryOnly)
Dim source2 As IEnumerable(Of String) = Directory.EnumerateDirectories(path, "*", SearchOption.TopDirectoryOnly)
Dim first As String() = Await Task.WhenAll(Of String)(
source.Select(Function(f As String) Task.Run(Of String)(
Function() ExportFile(f))
))
Dim source3 As IEnumerable(Of String)() =
Await Task.WhenAll(Of IEnumerable(Of String))(
source2.Select(Function(dir As String) _
Task.Run(Of IEnumerable(Of String))(
Function() ProcessDirectoryAsync(dir, searchPattern)
)))
Return first.Concat(source3.SelectMany(Function(r As IEnumerable(Of String)) r))
End Function
等待防止此代码为每个目录/文件生成数百万个线程。我做的快速测试显示了5-6个工作线程正在完成它的工作。性能提升可能是4倍。