Question

我有多个大小约为2GB的文本文件（大约7000万行）。我还有一台四核机器，可以访问Parallel Computing工具箱。

通常，您可以打开文件并读取行：

f = fopen('file.txt');
l = fgets(f);
while ~ isempty(l)
    % do something with l
    l = fgets(f);
end

我希望在我的4个内核中分发“对l执行某些操作”，但这当然需要使用parfor循环。这需要我先将2GB文件（借用Perl术语）“啜饮”到MATLAB中，而不是动态处理。我实际上并不需要l，只是处理的结果。

有没有办法用并行计算读取文本文件中的行？

编辑：值得一提的是，我可以提前找到确切的行数（!wc -l mygiantfile.txt）。

EDIT2：文件的结构如下：

15 1180 62444 e0e0 049c f3ec 104

3个十进制数，3个十六进制数和1个十进制数。重复七千万行。

Answer 1

一些matlab的内置函数支持多线程 - 列表为here。不需要Parallel Computing工具箱。

如果“使用l做某事”可以从工具箱中受益，只需在读取另一行之前实现该功能。

您也可以使用

阅读整个文件

fid = fopen('textfile.txt');
C  = textscan(fid,'%s','delimiter','\n');
fclose(fid);

然后并行计算C中的单元格。

如果阅读时间是关键问题，您可能还希望在parfor循环内访问部分数据文件。以下是Edric M Ellis的示例。

%Some data
x = rand(1000, 10);
fh = fopen( 'tmp.bin', 'wb' );
fwrite( fh, x, 'double' );
fclose( fh );

% Read the data
y = zeros(1000, 10);
parfor ii = 1:10
    fh = fopen( 'tmp.bin', 'rb' );
    % Get to the correct spot in the file:
    offset_bytes = (ii-1) * 1000 * 8; % 8 bytes/double
    fseek( fh, offset_bytes, 'bof' );
    % read a column
    y(:,ii) = fread( fh, 1000, 'double' );
    fclose( fh );
end

% Check
assert( isequal( x, y ) );

Answer 2

根据要求，我正在使用memory-mapped类显示memmapfile个文件的示例。

由于您没有提供数据文件的确切格式，我将创建自己的格式。我创建的数据是N行的表，每行包含4列：

首先是double标量值
秒是single值
third是一个固定长度的字符串，以HEX表示法表示uint32（例如：D091BB44）
第四列是uint8值

生成随机数据的代码，并将其写入如上所述结构的二进制文件：

% random data
N = 10;
data = [...
    num2cell(rand(N,1)), ...
    num2cell(rand(N,1,'single')), ...
    cellstr(dec2hex(randi(intmax('uint32'), [N,1]),8)), ...
    num2cell(randi([0 255], [N,1], 'uint8')) ...
];

% write to binary file
fid = fopen('file.bin', 'wb');
for i=1:N
    fwrite(fid, data{i,1}, 'double');
    fwrite(fid, data{i,2}, 'single');
    fwrite(fid, data{i,3}, 'char');
    fwrite(fid, data{i,4}, 'uint8');
end
fclose(fid);

以下是在HEX编辑器中查看的结果文件：

binary file viewed in a hex editor

我们可以确认第一条记录（注意我的系统使用Little-endian字节排序）：

>> num2hex(data{1,1})
ans =
3fd4d780d56f2ca6

>> num2hex(data{1,2})
ans =
3ddd473e

>> arrayfun(@dec2hex, double(data{1,3}), 'UniformOutput',false)
ans = 
    '46'    '35'    '36'    '32'    '37'    '35'    '32'    '46'

>> dec2hex(data{1,4})
ans =
C0

接下来，我们使用内存映射打开文件：

m = memmapfile('file.bin', 'Offset',0, 'Repeat',Inf, 'Writable',false, ...
    'Format',{
        'double', [1 1], 'd';
        'single', [1 1], 's';
        'uint8' , [1 8], 'h';      % since it doesnt directly support char
        'uint8' , [1 1], 'i'});

现在我们可以access the records作为普通的structure array：

>> rec = m.Data;      % 10x1 struct array

>> rec(1)             % same as: data(1,:)
ans = 
    d: 0.3257
    s: 0.1080
    h: [70 53 54 50 55 53 50 70]
    i: 192

>> rec(4).d           % same as: data{4,1}
ans =
    0.5799

>> char(rec(10).h)    % same as: data{10,3}
ans =
2B2F493F

对大型数据文件的好处是，您可以将映射“查看窗口”限制为一小部分记录，并沿文件移动此视图：

% read the records two at-a-time
numRec = 10;                       % total number of records
lenRec = 8*1 + 4*1 + 1*8 + 1*1;    % length of each record in bytes
numRecPerView = 2;                 % how many records in a viewing window

m.Repeat = numRecPerView;
for i=1:(numRec/numRecPerView)
    % move the window along the file
    m.Offset = (i-1) * numRecPerView*lenRec;

    % read the two records in this window:
    %for j=1:numRecPerView, m.Data(j), end
    m.Data(1)
    m.Data(2)
end

access a portion of a file using memory-mapping

我可以用Parallel Computing读取一个巨大的文本文件吗？

2 个答案: