电流

Question

我正在研究一种关于通过标题fafafafa将大文本文件（实际任务> 20GB，但你可以假定1GB）拆分成较小文件的慢速算法，并发现以下算法表示为伪代码

您可以通过以下命令创建2GB的二进制测试数据，讨论here

dd if=/dev/urandom of=sample.bin bs=64M count=16

电流

i=1;
matchCount=0;
while not end of file
    read.file
    while matchCount < i 
        match(header "fafafafa", file)
        match(2nd header "fafafafa", file)
        matchCount++; 
    end;
    store everything between two headers into a new file called rd$i.txt
    i++; 
end;

在每次迭代中一次又一次地从头开始读取文件

其他哪些事情会使这个算法变慢？

我的提议

i=1;
read.file
eventOn = 0; 
while line in linesInFile
    if not match header AND eventOn=0
        stop; # go to next line
    end;
    if match(header "fafafafa" in line)
        negate eventOn; # from 1 to 0; from 0 to 1.
        if newFile is not empty
            store.newFile as rd$i.txt
            newFile = ""; 
            i++;
            stop; # go to next line
        end;
    end; 

    if eventOn=1 
        newFile += line;
    end;
end;

这不应该一次又一次从文件的开头读取
我不确定newFile的数据结构是否足够addition和delete;堆栈可能是一个很好的;因为我只需要按标题拆分文件并快速

哪种数据结构适合快速分割？

如何考虑快速拆分大文件的算法？

Answer 1

单次扫描，将所有内容保存在缓冲区中，直到找到分隔符，然后写入新文件：

f=1
temp= empty string
i=0
delimiter= "fafafafa"
while not eof
   b=readbyte
   if b == delimiter[i]
      if ++i == delimiter length
          truncate delimiter from end of temp
          write temp to rd$f.txt
          f++
          i=0
          temp = delimiter
   else
      i=0

   temp += b

write temp to rd$f.txt

Answer 2

我不确定你的问题。你不想一次又一次地从头开始阅读文件。我会读一次该文件并记住标题的位置以分割“fafafafa”。指针将完成这项工作。然后你得到了你的大文件和一小部分指向特定标题的指针。

希望这有点帮助

快速拆分大文件的算法

电流

我的提议

2 个答案: