将数据块从大文件写入新文件的最快方法是什么?

时间:2020-01-29 08:58:50

标签: python-3.x awk bigdata

假设我有一个文件,它只是非常相似的块的重复(下面显示了简化的示例)。提取某些块并将其写入单独文件的最快方法是什么?所有块均以相同的数字\ n开头。输入文件可以有超过一百万步,每个块可以有数千个原子。因此,由于我只需要有限数量的步骤(例如,每1000个步骤),所以我不想读(巨大)文件或对其进行完全循环。我正在考虑bash脚本编制(sed或带分组的头),python(内存映射和使用正则表达式存储块)或awk(Write blocks in a text file to multiple new files)。有我不知道的任何方法或语言吗? 谢谢

6
step 1
C                  9.0000000    8.3380808    9.0000001
C                  9.0000000    9.6619194    8.9999999
H                  8.0768455    7.7678700    9.0000001
H                  9.9231545   10.2321301    9.0000001
H                  8.0768455   10.2321301    9.0000001
H                  9.9231545    7.7678700    9.0000001
6
step 2
 C                  9.00000000    8.33808080    9.00000010
 C                  9.00000000    9.66191940    8.99999990
 H                  8.07684550    7.76787000    9.00000010
 H                  9.90912982   10.23213008    8.83969637
 H                  8.09087028   10.23213012    9.16030383
 H                  9.92315450    7.76787000    9.00000010
6
step 3
 C                  9.00000000    8.33808080    9.00000010
 C                  9.00000000    9.66191940    8.99999990
 H                  8.07684550    7.76787000    9.00000010
 H                  9.86748170   10.23213006    8.68426301
 H                  8.13251850   10.23213014    9.31573717
 H                  9.92315450    7.76787000    9.00000010

1 个答案:

答案 0 :(得分:0)

我在awk中写了一个小的POC。这接近您想要的东西吗?

awk '
  /^[0-9]/ { print "skipping " $0; next; }
  /step /  { fn = sprintf("%s.%s", $1, $2); print "assigned fn = ", fn; }
  /^ *[A-Z]/ { print $0 >> fn; print "sent ", $0, " to ", fn; }
' infile

输出:

skipping 6
assigned fn =  step.1
sent  C                  9.0000000    8.3380808    9.0000001  to  step.1
sent  C                  9.0000000    9.6619194    8.9999999  to  step.1
sent  H                  8.0768455    7.7678700    9.0000001  to  step.1
sent  H                  9.9231545   10.2321301    9.0000001  to  step.1
sent  H                  8.0768455   10.2321301    9.0000001  to  step.1
sent  H                  9.9231545    7.7678700    9.0000001  to  step.1
skipping 6
assigned fn =  step.2
sent   C                  9.00000000    8.33808080    9.00000010  to  step.2
sent   C                  9.00000000    9.66191940    8.99999990  to  step.2
sent   H                  8.07684550    7.76787000    9.00000010  to  step.2
sent   H                  9.90912982   10.23213008    8.83969637  to  step.2
sent   H                  8.09087028   10.23213012    9.16030383  to  step.2
sent   H                  9.92315450    7.76787000    9.00000010  to  step.2
skipping 6
assigned fn =  step.3
sent   C                  9.00000000    8.33808080    9.00000010  to  step.3
sent   C                  9.00000000    9.66191940    8.99999990  to  step.3
sent   H                  8.07684550    7.76787000    9.00000010  to  step.3
sent   H                  9.86748170   10.23213006    8.68426301  to  step.3
sent   H                  8.13251850   10.23213014    9.31573717  to  step.3
sent   H                  9.92315450    7.76787000    9.00000010  to  step.3

结果文件:

$: cat step.1
C                  9.0000000    8.3380808    9.0000001
C                  9.0000000    9.6619194    8.9999999
H                  8.0768455    7.7678700    9.0000001
H                  9.9231545   10.2321301    9.0000001
H                  8.0768455   10.2321301    9.0000001
H                  9.9231545    7.7678700    9.0000001
$: cat step.2
 C                  9.00000000    8.33808080    9.00000010
 C                  9.00000000    9.66191940    8.99999990
 H                  8.07684550    7.76787000    9.00000010
 H                  9.90912982   10.23213008    8.83969637
 H                  8.09087028   10.23213012    9.16030383
 H                  9.92315450    7.76787000    9.00000010
$: cat step.3
 C                  9.00000000    8.33808080    9.00000010
 C                  9.00000000    9.66191940    8.99999990
 H                  8.07684550    7.76787000    9.00000010
 H                  9.86748170   10.23213006    8.68426301
 H                  8.13251850   10.23213014    9.31573717
 H                  9.92315450    7.76787000    9.00000010

请注意,您的示例在第一节中没有前导空格,但在后续节中有一个空格。

根据需要进行调整,希望对您有所帮助。