Question

我一般对编程和Python还是陌生的。我有一个很大的CSV文件，需要根据目标列（最后一列）的目标值将其分成多个CSV文件。

这是我要拆分的CSV文件数据的简化版本。

1254.00   1364.00   4562.33   4595.32   1
1235.45   1765.22   4563.45   4862.54   1
6235.23   4563.00   7832.31   5320.36   1
8623.75   5632.09   4586.25   9361.86   0
5659.92   5278.21   8632.02   4567.92   0
4965.25   1983.78   4326.50   7901.10   1
7453.12   4993.20   4573.30   8632.08   1
8963.51   7496.56   4219.36   7456.46   1
9632.23   7591.63   8612.37   4591.00   1
7632.08   4563.85   4632.09   6321.27   0
4693.12   7621.93   5201.37   7693.48   0
6351.96   7216.35   795.52    4109.05   0

我想分割，以便输出提取不同csv文件中的数据，如下所示：

sample1.csv

1254.00   1364.00   4562.33   4595.32   1
1235.45   1765.22   4563.45   4862.54   1
6235.23   4563.00   7832.31   5320.36   1

sample2.csv

8623.75   5632.09   4586.25   9361.86   0
5659.92   5278.21   8632.02   4567.92   0

sample3.csv

4965.25   1983.78   4326.50   7901.10   1
7453.12   4993.20   4573.30   8632.08   1
8963.51   7496.56   4219.36   7456.46   1
9632.23   7591.63   8612.37   4591.00   1

sample4.csv

7632.08   4563.85   4632.09   6321.27   0
4693.12   7621.93   5201.37   7693.48   0
6351.96   7216.35   795.52    4109.05   0

我尝试使用pandas和一些groupby函数，但是它将所有1和0合并到单独的文件中，其中一个文件的所有值都包含1，另一个包含0，这不是我需要的输出。

任何帮助将不胜感激。

Answer 1

您可以做的是获取每一行最后一列的值。如果该值与上一行中的值相同，则将该行添加到同一列表中，如果不只是创建一个新列表，然后将该行添加到该空列表中。对于数据结构，请使用列表列表。

Answer 2

假设文件“ input.csv”包含原始数据。

1254.00   1364.00   4562.33   4595.32   1
1235.45   1765.22   4563.45   4862.54   1
6235.23   4563.00   7832.31   5320.36   1
8623.75   5632.09   4586.25   9361.86   0
5659.92   5278.21   8632.02   4567.92   0
4965.25   1983.78   4326.50   7901.10   1
7453.12   4993.20   4573.30   8632.08   1
8963.51   7496.56   4219.36   7456.46   1
9632.23   7591.63   8612.37   4591.00   1
7632.08   4563.85   4632.09   6321.27   0
4693.12   7621.93   5201.37   7693.48   0
6351.96   7216.35   795.52    4109.05   0

下面的代码

target = None
counter = 0
with open('input.csv', 'r') as file_in:
    lines = file_in.readlines()
    tmp = []
    for idx, line in enumerate(lines):
        _target = line.split(' ')[-1].strip()
        if idx == 0:
            tmp.append(line)
            target = _target
            continue
        else:
            last_line = idx + 1 == len(lines)
            if _target != target or last_line:
                if last_line:
                    tmp.append(line)
                counter += 1
                with open('sample{}.csv'.format(counter), 'w') as file_out:
                    file_out.writelines(tmp)
                tmp = [line]
            else:
                tmp.append(line)
            target = _target

Answer 3

也许您想要这样的东西：

from itertools import groupby
from operator import itemgetter

sep = '   '

with open('data.csv') as f:
    data = f.read()

split_data = [row.split(sep) for row in data.split('\n')]
gb = groupby(split_data, key=itemgetter(4))

for index, (key, group) in enumerate(gb):
    with open('sample{}.csv'.format(index), 'w') as f:
        write_data = '\n'.join(sep.join(cell) for cell in group)
        f.write(write_data)

与pd.groupby不同，itertools.groupby不会事先对源进行排序。这会将输入的CSV解析为列表列表，并根据包含目标的第5列对外部列表执行groupby。 groupby对象是组上的迭代器；通过将每个组写入不同的文件，可以实现所需的结果。

Answer 4

我建议使用一个函数来完成所要求的工作。

有可能使未引用的文件对象我们已经打开进行写作，因此它们在关闭时会自动关闭垃圾收集，但在这里我更喜欢显式关闭每个输出打开另一个文件之前。

该脚本已被大量评论，因此无需进一步说明：

def split_data(data_fname, key_len=1, basename='file%03d.txt')

    data = open(data_fname)

    current_output = None # because we have yet not opened an output file
    prev_key = int(1)     # because a string is always different from an int
    count = 0             # because we want to count the output files

    for line in data:

        # line has a trailing newline so that to extract the key
        # we have to take into account that
        key = line[-key_len-1:-1]

        if key !=  prev_key     # key has changed!

           count += 1           # a new file is going to be opened
           prev_key = key       # remember the new key
           if current_output:   # if a file was opened, close it
               current_output.close()
           # open a new output file, its name derived from the variable count
           current_output = open(basename%count, 'w')

        # now we can write to the output file
        current_output.write(line)
        # note that line is already newline terminated

    # clean up what is still going
    current_output.close()

_{此答案具有an history。}

通过目标列值将CSV文件拆分为多个csv

4 个答案: