过去 3 年中,我们每天都收到来自交易对手的文件。所以这意味着我们现在有超过 1000 个文件。根据一天的不同,它们每个都有 5000 到 15000 行。
我试图在 Visual Studio Code 中使用一些谷歌搜索和研究将它结合到 Python 中。为了测试,我只取了每个月最后一天的文件。一共33个文件。
文件是这样的
File 1:
Header_1 Header_2 Header 3
0 2 1
2 3 2
4 3
File 2:
Header_1 Header_4 Header_3 Header_2
6 4 3 1
8 5 4
10
Desired Output
Header_1 Header_2 Header_3 Header_4 File_Name
0 2 1 File 1
2 3 2 File 1
4 3 File 1
6 1 3 4 File 2
8 4 5 File 2
10
我用来尝试这个的代码是:
import os
import pandas as pd
import glob
#set working directory
os.chdir("/filepath/")
globbed_files = glob.glob("*.csv") #creates a list of all csv files
print(globbed_files)
data = [] # pd.concat takes a list of dataframes as an agrument
for csv in globbed_files:
frame = pd.read_csv(csv)
data.append(frame)
print (frame) #to check while running whether the frame was correct
bigframe = pd.concat(data, ignore_index=True, keys=globbed_files)
bigframe.to_csv("output.csv")
如果需要,我可以放弃文件名,空单元格可以是 NaN 或只是空的,这很好。但是现在我的标题没有对齐,我会得到完全不匹配的列。
答案 0 :(得分:0)
您的代码似乎有效。我刚刚添加了 File_name 列并重新排列了列
import pandas as pd
# you can use your own files here, I'm just using this to test
df1 = pd.DataFrame({"header_1":[1,2,3,4],"header_2": [2,3,4,6]})
df2 = pd.DataFrame({"header_4":[1,2,3,4],"header_3": [2,3,4,5]})
globbed_files = [df1,df2] #creates a list of all csv files
print(globbed_files)
data = [] # pd.concat takes a list of dataframes as an argument
i=1 # use this to set the file name counter
for csv in globbed_files:
frame = csv
frame["File_Name"] = "File " + str(i) # File_Name values are set here
data.append(frame)
i+=1
bigframe = pd.concat(data, ignore_index=True, keys=globbed_files)
bigframe = bigframe.reindex(sorted(df.columns), axis=1) # this arranges your columns alphabetically
bigframe = bigframe[ [ col for col in bigframe.columns if col != 'File_Name' ] + ['File_Name'] ] # takes File_Name column to the end