Question

如何拆分大型csv文件（~100GB）并保留每个部分的标题？

例如

h1 h2
a  aa
b  bb

到

h1 h2
a  aa

和

h1 h2
b  bb

Answer 1

首先，您需要分隔标题和内容：

header=$(head -1 $file)
data=$(tail -n +2 $file)

然后你要分割数据

echo $data | split [options...] -

在选项中，您必须指定块的大小以及结果文件名称的模式。不得删除尾随-，因为它指定split从stdin读取数据。

然后您可以在每个文件的顶部插入标题

sed -i "1i$header" $splitOutputFile

您显然应该在for循环中执行最后一部分，但其确切代码将取决于为split操作选择的前缀。

Answer 2

我发现任何以前的解决方案都无法在我的脚本所针对的mac系统上正常工作（为什么Apple？为什么？）我最终得到了一个非常好的printf选项作为概念证明。我将通过将临时文件放入ramdisk等来提高性能，因为它在磁盘上放了一堆并且可能会很慢。

#!/bin/sh

# Pass a file in as the first argument on the command line (note, not secure)
file=$1

# Get the header file out
header=$(head -1 $file)

# Separate the data from the header
tail -n +2 $file > output.data

# Split the data into 1000 lines per file (change as you wish)
split -l 1000 output.data output

# Append the header back into each file from split 
for part in `ls -1 output*`
do
  printf "%s\n%s" "$header" "`cat $part`" > $part
done

Answer 3

您可以从here下载免费的CsvSplitter。它是网站上的一个zip文件，其中包含一个简单的可移植.exe文件和一个.txt文件，与可执行文件一起使用是必需的，只需将内容提取到某个目录中即可开始工作：

它可以拆分文件，如该图片所示

一切都是不言自明的，但可以找到更多详细信息 here

拆分大型csv文件并在每个部分中保留标头

3 个答案: