Question

我有一个数据导入脚本，它读取行并将它们添加到数据库中，到目前为止一直很好。不幸的是，脚本（或其运行时或数据库库或其他）中的某些东西是内存泄漏，因此大型导入使用单调增加的主内存，导致交换缓慢，然后内存耗尽的进程死亡。将导入分为多次运行是一种解决方法;我一直在用split做这个，然后在每个部分上执行导入脚本的循环执行。

但是我宁愿跳过分割文件，这感觉它应该是一个单行。事实上，似乎应该有一个等价的xargs将行传递给stdin上的指定命令，而不是作为参数。如果这个假设命令是xlines，那么我希望以下内容为giantfile.txt中每批最多50,000行运行myimport脚本：

cat giantfile.txt | xlines -L 50000 myimport

我是否错过了xlines - 类似其他名称的功能，或者隐藏在其他命令的选项中？或者可以在几行BASH脚本中完成xlines？

Answer 1

使用GNU Parallel - 可用here。

您将需要--pipe选项以及--block选项（它采用字节大小，而不是行数）。

有些事情：

cat giantfile.txt | parallel -j 8 --pipe --block 4000000 myimport

（那就是选择50,000行的块大小* 80字节= 4000000，这里也可以缩写为4m。）

如果您不希望作业实际并行运行，请将8更改为1。或者，你可以完全放弃它，它将为每个CPU核心运行一个作业。

您还可以通过运行

来避开cat

parallel ... < giantfile.txt

Answer 2

将以下代码保存为test.sh脚本。

 #!/bin/bash
tempFile=/tmp/yourtempfile.temp
rm -f tempFile > /dev/null 2>&1
declare -i cnt=0
while read line
do
    cnt=$(($cnt+1))
    if [[ $cnt < $1 || $cnt == $1 ]]; then
            echo $line >> tempFile
    else
        echo $line >> tempFile
        cat tempFile | myimport
        rm -f tempFile > /dev/null 2>&1
        cnt=$((0))
    fi
done < $2

exit 0

然后运行./test.sh 500000 giantfile.txt。我使用tempFile来保存指定数量的行，然后使用你的导入脚本来处理它。我希望它有所帮助。

Answer 3

我的方法，没有安装parallel，也没有编写临时文件：

#!/bin/bash

[ ! -f "$1" ] && echo "missing file." && exit 1

command="$(which cat)" # just as example, insert your command here
totalSize="$(wc -l $1 | cut -f 1 -d ' ')"
chunkSize=3 # just for the demo, set to 50000 in your version
offset=1

while [ $[ $totalSize + 1 ] -gt $offset ]; do

        tail -n +$offset $1 | head -n $chunkSize | $command
        let "offset = $offset + $chunkSize"
        echo "----"
done

测试：

seq 1000 1010 > testfile.txt
./splitter.sh testfile.txt

输出：

1000
1001
1002
----
1003
1004
1005
----
1006
1007
1008
----
1009
1010
----

这样，解决方案仍然可移植，性能优于临时文件。

建议的方法将stdin行批处理到另一个重复命令，比如xargs但是通过stdin而不是参数？

3 个答案: