Question

我有10个csv文件，它们具有完全相同的列和数据类型。什么是最快/最有效的堆叠方式？

CSV1：

col1 | col2 | col3
  1  |  'a' |  0.1
  2  |  'b' |  0.8

CSV2：

col1 | col2 | col3
  3  |  'c' |  0.4
  4  |  'd' |  0.3

我可以用Pandas读取它们并重复df.append但这似乎很慢，因为我必须将所有内容读入内存，如果文件非常大，可能需要一段时间。想知道我是否可以使用bash命令或其他Python包更快地完成它。

我宁愿不使用任何重度依赖或需要编译的东西。

P.S。如果解决方案COULD还自动处理存在于一个数据集中而不是另一个数据集中的列，则会获得奖励积分。

Answer 1

使用head和tail

的解决方案

head -n1 a.log > output.log
for f in a.log b.log; do tail -n+2 $f; done >> output.log

如果您的输入文件最后没有换行符，则必须手动添加它，如@zwar所述。针对此问题的许多解决方案都是in this thread。在这种情况下我最喜欢的是

head -n1 a.log > output.log
for f in a.log b.log
do
  tail -n+2 $f
  [ -n "$(tail -c1 $f)" ] && echo ""
done >> output.log

Answer 2

纯Python解决方案：

csv_in = ["csv1.csv", "csv2.csv"]  # paths of CSVs to 'concentrate'
csv_out = "output.csv"

skip_header = False
with open(csv_out, "w") as dest:
    for csv in csv_in:
        with open(csv, "r") as src:
            if skip_header:  # skip the CSV header in consequent files
                next(src)
            for line in src:
                dest.write(line)
                if line[-1] != "\n":  # if not present, write a new line after each row
                    dest.write("\n")
            skip_header = True  # make sure only the first CSV header is included

要合并具有差异化列数的数据，您必须至少部分解析CSV。

Answer 3

如果你想做一个python解决方案

import csv

my_files = ['file_one.csv', 'file_two.csv']
final_file = []
for fi in files:
     with open(fi, r) as f:
          reader = csv.reader(f, delimiter='|')
          for row in reader:
               final_file.append(row)

#write out final file
with open('final_file.csv', 'w') as out:
    for line in final_file:
         out.write('|'.join(line))
         out.write('\n')

Answer 4

正如@ comment to another answer中正确注意到的那样，如果输入的CSV在最后一行错过换行符号，此解决方案将无法正常工作。

使用bash和sed的解决方案（假设所有文件具有相同的列/分隔符，并且所有文件都包含标题行）：

<强> concat_csv_files ：

#!/usr/bin/env bash

head -n1 "$1"
for f do
    sed -e 1d "$f" # or: tail -n+2 "$f"
done

示例：

concat_csv_files csv* > stacked.csv

Answer 5

这是另一个纯Python解决方案。我们的想法是使用glob来构建要处理的文件列表，然后将它们单独导入到单独的pandas数据帧中（并将每个数据帧添加到列表中）。然后将数据帧列表连接成一个。你只想这样做一次，而不是使用重复的df.append调用（它太慢了）。我发现为每列指定数据类型有助于加快速度。

import os
import glob
import numpy as np
import pandas as pd

def process_csv_file(f):

    print("Processing file {}".format(f))

    # check if it's an empty file (have to be able to append an empty dataframe)
    # specifying the datatypes speeds up the process because pandas doesn't have to guess.
    if os.stat(f).st_size > 0:
        df = pd.read_csv(f, sep = ',', dtype = {'col1' : str, 'col2' : float}, memory_map=True)
    else:
        df = pd.DataFrame()

    return(df)

csv_files = glob.glob(indir +'/**/' + '*.csv', recursive = True)
print ("Found {} files to parse.".format(len(csv_files)))
frames = [process_csv_file(f) for f in csv_files]

csv_df = pd.concat(frames)

堆叠CSV文件的最快方法

5 个答案: