Question

我有这个大文件，内容如下：

Column1 column2 column3
 345     367    Ramesh
 456     469    Ramesh
 300     301    Ramesh
 298     390    Naresh
 123     125    Suresh
 394     305    Suresh
 ......
 .....

现在，我想根据column3中的名称将此文件拆分为小文件。像这样：

File1：Ramesh.txt

column1 column2 column3
345     367      Ramesh
456     469      Ramesh
300     301      Ramesh

File2：Naresh.txt

column1 column2 column3
298     390     Naresh

File3：Suresh.txt

Column1 column2 column3
123     125      suresh
394     305      suresh

同样如此。我编写了以下python代码，它起作用了：

def split_file(file1):
source=open(file1)
l=[]
header=0
header_line=""
file_count=0
for line in source:
    line=line.rstrip()
    a=line.split()
    if header==0:
        header_line=line
        header+=1
    else:
        if a[-1] not in l:
            l.append(a[-1])
            file_count+=1
            if file_count>1:
                dest.close()
            else:
                pass
            dest=open(a[-1],'a')
            dest.write(header_line+"\n"+line+"\n")
        else:
            dest.write(line+"\n")
source.close()
dest.close()

现在，我的查询是即使column3未排序，我如何修改这些代码才能工作。例如：

Column1 column2 column3
345     367    Ramesh
123     125    Suresh
456     469    Ramesh
298     390    Naresh
300     301    Ramesh
394     305    Suresh

我应该将随机变量生成为值（以处理输出文件），并将column3中的名称作为键。每次脚本遇到密钥时使用这个字典打开文件？任何建议将不胜感激。

Answer 1

不是在每一行上打开和关闭文件指针，而是在工作完成之前将它们打开。

首先为文件指针创建一个字典：

fps = {}

然后在迭代数据文件的循环中，如果文件指针不存在，则创建它：

if a[-1] not in fps.keys():
    fps[a[-1]] = open(a[-1], 'a')
fps[a[-1]].write(line)

然后在循环结束时，您可以关闭文件指针：

for f in fps.values():
    f.close()

Answer 2

def split_file(filename):
    dest = {}
    with open(filename) as source:
        header_line = next(source)
        for line in source:
            name = line.rstrip().split()[-1]
            if name not in dest:
                dest[name] = open(name + '.txt', 'w')
                dest[name].write(header_line)
            dest[name].write(line)
    for d in dest.values():
        d.close()

Answer 3

这是pandas数据帧的groupby()函数的一个主要示例：

import pandas as pd

data = pd.read_csv('dat.csv', delimiter="\s+")
for val, df in data.groupby(['column3']):
    df.to_csv(val + ".csv", sep='\t', index=False)

步骤相对简单：

1）使用正确的分隔符读取文件（\s+代表任意数量的空格）。

2）循环遍历包含(common value, dataframe for that value)

形式的元组的groupy对象

2.1）为每个具有相应名称的数据帧生成一个文件。（index=False只是声明我们不想在新文件中打印索引。）

Answer 4

您可以为column3的每个值创建一个新的文件句柄，然后将其全部写入该文件，例如：

import os

def split_file(path):
    file_handles = {}  # a map of file handles based on the last param
    target_path = os.path.dirname(path)  # get the location of the passed file path
    with open(path, "r") as f:  # open our input file for reading
        header = next(f)  # reads the first line to use as a header in all files
        for line in f:
            index = line.rfind(" ")  # replace with \t if you use tab-delimited files
            value = line[index+1:].rstrip()  # get the last value
            if not value:  # invalid entry, skip
                continue
            if value not in file_handles:  # we haven't started writing to this file
                # create a new file with the value of the last column
                handle = open(os.path.join(target_path, value + ".txt"), "a")
                handle.write(header)  # write the header to our new file
                file_handles[value] = handle  # store it to our file handles list
            else:
                handle = file_handles[value]
            handle.write(line)  # write the current line to the designated handle
    for handle in file_handles.values():  # close our output file handles
        handle.close()

然后你可以用简单的方法运行它：

split_file("your_file.dat")

如果你传递它们，它甚至会尊重文件路径。

基于可随机出现的Python模式将大文件拆分为小文件

4 个答案: