大家好我正在学习MPI编程,我遇到了这个问题。假设我有一个包含100,000行/行的.txt文件,如何将它们分块以便由4个处理器处理?即我想让处理器0负责0-25000行的处理,处理器1处理25001-50000等处理。我做了一些搜索,确实遇到了MPI_File_seek,但我不确定它是否可以在.txt上运行并且之后支持fscanf。
答案 0 :(得分:21)
文本对于并行处理来说并不是一个很好的格式,因为你事先并不知道第25001行开始的地方。因此,这些问题通常会通过一些预处理步骤提前处理,要么建立索引,要么将文件分区为适当数量的块以供每个进程读取。
如果您真的想通过MPI进行,我建议使用MPI-IO将文本文件的重叠块读入各种处理器,其中重叠比您预期的最长线要长得多,然后让每个处理器同意从哪里开始;例如,你可以说过程N和N + 1共享的重叠区域中的第一个(或最后一个)新行是过程N离开并且N + 1开始的地方。
要使用一些代码进行跟进,
#include <stdio.h>
#include <mpi.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>
void parprocess(MPI_File *in, MPI_File *out, const int rank, const int size, const int overlap) {
MPI_Offset globalstart;
int mysize;
char *chunk;
/* read in relevant chunk of file into "chunk",
* which starts at location in the file globalstart
* and has size mysize
*/
{
MPI_Offset globalend;
MPI_Offset filesize;
/* figure out who reads what */
MPI_File_get_size(*in, &filesize);
filesize--; /* get rid of text file eof */
mysize = filesize/size;
globalstart = rank * mysize;
globalend = globalstart + mysize - 1;
if (rank == size-1) globalend = filesize-1;
/* add overlap to the end of everyone's chunk except last proc... */
if (rank != size-1)
globalend += overlap;
mysize = globalend - globalstart + 1;
/* allocate memory */
chunk = malloc( (mysize + 1)*sizeof(char));
/* everyone reads in their part */
MPI_File_read_at_all(*in, globalstart, chunk, mysize, MPI_CHAR, MPI_STATUS_IGNORE);
chunk[mysize] = '\0';
}
/*
* everyone calculate what their start and end *really* are by going
* from the first newline after start to the first newline after the
* overlap region starts (eg, after end - overlap + 1)
*/
int locstart=0, locend=mysize-1;
if (rank != 0) {
while(chunk[locstart] != '\n') locstart++;
locstart++;
}
if (rank != size-1) {
locend-=overlap;
while(chunk[locend] != '\n') locend++;
}
mysize = locend-locstart+1;
/* "Process" our chunk by replacing non-space characters with '1' for
* rank 1, '2' for rank 2, etc...
*/
for (int i=locstart; i<=locend; i++) {
char c = chunk[i];
chunk[i] = ( isspace(c) ? c : '1' + (char)rank );
}
/* output the processed file */
MPI_File_write_at_all(*out, (MPI_Offset)(globalstart+(MPI_Offset)locstart), &(chunk[locstart]), mysize, MPI_CHAR, MPI_STATUS_IGNORE);
return;
}
int main(int argc, char **argv) {
MPI_File in, out;
int rank, size;
int ierr;
const int overlap = 100;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (argc != 3) {
if (rank == 0) fprintf(stderr, "Usage: %s infilename outfilename\n", argv[0]);
MPI_Finalize();
exit(1);
}
ierr = MPI_File_open(MPI_COMM_WORLD, argv[1], MPI_MODE_RDONLY, MPI_INFO_NULL, &in);
if (ierr) {
if (rank == 0) fprintf(stderr, "%s: Couldn't open file %s\n", argv[0], argv[1]);
MPI_Finalize();
exit(2);
}
ierr = MPI_File_open(MPI_COMM_WORLD, argv[2], MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &out);
if (ierr) {
if (rank == 0) fprintf(stderr, "%s: Couldn't open output file %s\n", argv[0], argv[2]);
MPI_Finalize();
exit(3);
}
parprocess(&in, &out, rank, size, overlap);
MPI_File_close(&in);
MPI_File_close(&out);
MPI_Finalize();
return 0;
}
在问题文本的窄版本上运行,我们得到
$ mpirun -n 3 ./textio foo.in foo.out
$ paste foo.in foo.out
Hi guys I am learning to 11 1111 1 11 11111111 11
program in MPI and I came 1111111 11 111 111 1 1111
across this question. Lets 111111 1111 111111111 1111
say I have a .txt file with 111 1 1111 1 1111 1111 1111
100,000 rows/lines, how do 1111111 11111111111 111 11
I chunk them for processing 1 11111 1111 111 1111111111
by 4 processors? i.e. I want 22 2 22222222222 2222 2 2222
to let processor 0 take care 22 222 222222222 2 2222 2222
of the processing for lines 22 222 2222222222 222 22222
0-25000, processor 1 to take 22222222 222222222 2 22 2222
care of 25001-50000 and so 2222 22 22222222222 222 22
on. I did some searching and 333 3 333 3333 333333333 333
did came across MPI_File_seek 333 3333 333333 3333333333333
but I am not sure can it work 333 3 33 333 3333 333 33 3333
on .txt and supports fscanf 33 3333 333 33333333 333333
afterwards. 33333333333