我有这个大文件,内容如下:
Column1 column2 column3
345 367 Ramesh
456 469 Ramesh
300 301 Ramesh
298 390 Naresh
123 125 Suresh
394 305 Suresh
......
.....
现在,我想根据column3中的名称将此文件拆分为小文件。像这样:
File1:Ramesh.txt
column1 column2 column3
345 367 Ramesh
456 469 Ramesh
300 301 Ramesh
File2:Naresh.txt
column1 column2 column3
298 390 Naresh
File3:Suresh.txt
Column1 column2 column3
123 125 suresh
394 305 suresh
同样如此。 我编写了以下python代码,它起作用了:
def split_file(file1):
source=open(file1)
l=[]
header=0
header_line=""
file_count=0
for line in source:
line=line.rstrip()
a=line.split()
if header==0:
header_line=line
header+=1
else:
if a[-1] not in l:
l.append(a[-1])
file_count+=1
if file_count>1:
dest.close()
else:
pass
dest=open(a[-1],'a')
dest.write(header_line+"\n"+line+"\n")
else:
dest.write(line+"\n")
source.close()
dest.close()
现在,我的查询是即使column3未排序,我如何修改这些代码才能工作。例如:
Column1 column2 column3
345 367 Ramesh
123 125 Suresh
456 469 Ramesh
298 390 Naresh
300 301 Ramesh
394 305 Suresh
我应该将随机变量生成为值(以处理输出文件),并将column3中的名称作为键。每次脚本遇到密钥时使用这个字典打开文件?任何建议将不胜感激。
答案 0 :(得分:1)
不是在每一行上打开和关闭文件指针,而是在工作完成之前将它们打开。
首先为文件指针创建一个字典:
fps = {}
然后在迭代数据文件的循环中,如果文件指针不存在,则创建它:
if a[-1] not in fps.keys():
fps[a[-1]] = open(a[-1], 'a')
fps[a[-1]].write(line)
然后在循环结束时,您可以关闭文件指针:
for f in fps.values():
f.close()
答案 1 :(得分:1)
def split_file(filename):
dest = {}
with open(filename) as source:
header_line = next(source)
for line in source:
name = line.rstrip().split()[-1]
if name not in dest:
dest[name] = open(name + '.txt', 'w')
dest[name].write(header_line)
dest[name].write(line)
for d in dest.values():
d.close()
答案 2 :(得分:1)
这是pandas数据帧的groupby()
函数的一个主要示例:
import pandas as pd
data = pd.read_csv('dat.csv', delimiter="\s+")
for val, df in data.groupby(['column3']):
df.to_csv(val + ".csv", sep='\t', index=False)
步骤相对简单:
1)使用正确的分隔符读取文件(\s+
代表任意数量的空格)。
2)循环遍历包含(common value, dataframe for that value)
2.1)为每个具有相应名称的数据帧生成一个文件。
(index=False
只是声明我们不想在新文件中打印索引。)
答案 3 :(得分:0)
您可以为column3
的每个值创建一个新的文件句柄,然后将其全部写入该文件,例如:
import os
def split_file(path):
file_handles = {} # a map of file handles based on the last param
target_path = os.path.dirname(path) # get the location of the passed file path
with open(path, "r") as f: # open our input file for reading
header = next(f) # reads the first line to use as a header in all files
for line in f:
index = line.rfind(" ") # replace with \t if you use tab-delimited files
value = line[index+1:].rstrip() # get the last value
if not value: # invalid entry, skip
continue
if value not in file_handles: # we haven't started writing to this file
# create a new file with the value of the last column
handle = open(os.path.join(target_path, value + ".txt"), "a")
handle.write(header) # write the header to our new file
file_handles[value] = handle # store it to our file handles list
else:
handle = file_handles[value]
handle.write(line) # write the current line to the designated handle
for handle in file_handles.values(): # close our output file handles
handle.close()
然后你可以用简单的方法运行它:
split_file("your_file.dat")
如果你传递它们,它甚至会尊重文件路径。