我正在尝试在 Matlab 中实现一个函数,该函数搜索文件并并行执行以加快进程。我已经在以下功能中成功实现了这一点:
function matches = searchfordata(starting_path, search_depth, checkFunction)
arguments
starting_path {isfolder}
search_depth int64
checkFunction function_handle
end
tic;
folders = struct('name', '' , 'folder', starting_path);
dataMap = containers.Map('KeyType', 'double', 'ValueType', 'any');
next_folders = [];
matches = [];
no_folders = height(folders);
current_depth = 0;
total_folders = 0;
if search_depth < 0
search_depth = 9001;
end
while current_depth <= search_depth && no_folders > 0
total_folders = total_folders + no_folders;
parfor n = 1:no_folders
path = strcat(folders(n).folder, filesep, folders(n).name);
[files, cfolders] = filesandfolders(path);
if height(files) > 0
check = checkFunction(files);
else
check = [];
end
matches = [matches;check];
next_folders = [next_folders; cfolders];
end
if height(matches) > 0
dataMap(current_depth) = matches;
matches = [];
end
folders = next_folders;
no_folders = height(folders);
next_folders = [];
current_depth = current_depth + 1;
end
matches = dataMap;
toc;
end
与此相关的其他函数/类:
function [files, folders] = filesandfolders(path)
%UNTITLED Summary of this function goes here
% Detailed explanation goes here
directory_contents = dir(path);
files = directory_contents(~[directory_contents.isdir]);
folders = directory_contents([directory_contents.isdir]);
folders = folders(~ismember({folders.name}, {'.', '..'}));
end
基本上是一个目录,它将结果拆分为文件和文件夹并删除“。”和文件夹结果中的“..”。
function boolobject = checkFiles(files)
%CHECKFILES Checks given files for powercycler files
% Detailed explanation goes here
%cyclingregex = 'cycling_parameters\.xml';
%transientregex = '\.(pol|par|raw)$';
cyclingregex = '\.txt$';
transientregex = '\.txt$';
matching_cycling = regexpi({files.name}, cyclingregex, 'Match');
matching_transient = regexpi({files.name}, transientregex, 'Match');
cycling_indices = ~cellfun(@isempty, matching_cycling);
transient_indices = ~cellfun(@isempty, matching_transient);
boolobject = FolderData(files(1).folder);
boolobject.cyclingData = any(cycling_indices);
boolobject.rthData = any(transient_indices);
if boolobject.cyclingData || boolobject.rthData
return
else
boolobject = [];
end
end
这会从文件和文件夹中获取文件列表作为我正在搜索的文件的输入和过滤器。我将其更改为 txt 以获得更好的可重现性。这个函数的输出是这个类:
classdef FolderData < handle
%FOLDERDATA Summary of this class goes here
% Detailed explanation goes here
properties
folder %folderpath
rthData %bool, true if this folder contains rth-data files
cyclingData %bool, true if this folder contains cycling-data files
end
methods
function this = FolderData(path)
this.rthData = false;
this.cyclingData = false;
this.folder = path;
end
end
end
这只是说明找到了哪些文件以及在哪个文件夹中。
顶部的实际搜索功能在我的驱动器上需要 8-30 秒并且正在运行。现在我想我可以用 afterEach 加快速度。基本思想是,如果 parfor 循环中正在处理的文件夹的内容在数量上有很大差异,则有一个文件夹阻止该过程,因为它需要在函数恢复工作之前完成parfor 循环。
为此,我创建了以下脚本:
clc;
clear all;
path = 'D:\';
if isempty(gcp('nocreate'))
parpool(4);
end
fun = @checkFiles;
output = searchparforae(path, fun);
%output = searchforae(path, fun);
%output = searchfordata(path, 100, fun);
function matches = searchparforae(starting_path, checkFunction)
tic;
folder_que = parallel.pool.DataQueue;
matches = [];
listener = afterEach(folder_que, @search_folder);
starting_path = struct('folder', starting_path, 'name', '');
search_folder(starting_path);
function search_folder(input)
parfor n = 1:height(input)
folder_path = strcat(input(n).folder, filesep, input(n).name);
fprintf(1, folder_path);
fprintf(1, '\n');
[files, folders] = filesandfolders(folder_path);
if height(files) > 0
check = checkFunction(files);
else
check = [];
end
matches = [matches;check];
send(folder_que, folders);
end
end
toc;
end
function matches = searchforae(starting_path, checkFunction)
tic;
folder_que = parallel.pool.DataQueue;
matches = [];
listener = afterEach(folder_que, @search_folder);
starting_path = struct('folder', starting_path, 'name', '');
search_folder(starting_path);
function search_folder(input)
for n = 1:height(input)
folder_path = strcat(input(n).folder, filesep, input(n).name);
fprintf(1, folder_path);
fprintf(1, '\n');
[files, folders] = filesandfolders(folder_path);
if height(files) > 0
check = checkFunction(files);
else
check = [];
end
matches = [matches;check];
send(folder_que, folders);
end
end
toc;
end
“searchforae”和“searchparforae”这两个函数完全一样,只是循环不同。从名称中可以明显看出“searchforae”有一个 for 循环,而“searchparforae”有一个 parfor 循环。
现在 searchforae 根本不起作用。打印输出显示,searchforae 仅处理初始给定目录及其正下方目录中的文件。打印输出:
D:\
D:\$RECYCLE.BIN
D:\Downloads
D:\OneDriveTemp
D:\Programme
D:\Repositories
D:\Sonstiges
D:\Spiele
D:\System Volume Information
D:\Uni
D:\Uni2
D:\Users
D:\Zwischenablage
相比之下,searchparforae 函数与顶部的 searchfordata 函数一样有效。但它需要 5-10 分钟而不是 8-30 秒。我使用 afterEach 错了吗?为什么要花那么长时间?另外,为什么 searchforae 函数不能正常工作,即使与 searchparforae 相比,唯一的区别是 for 循环而不是 parfor?