将 CSV 文件与附加列相结合 - 并对齐列

时间:2021-04-13 07:31:04

标签: python pandas

过去 3 年中,我们每天都收到来自交易对手的文件。所以这意味着我们现在有超过 1000 个文件。根据一天的不同,它们每个都有 5000 到 15000 行。

我试图在 Visual Studio Code 中使用一些谷歌搜索和研究将它结合到 Python 中。为了测试,我只取了每个月最后一天的文件。一共33个文件。

文件是这样的

File 1:

Header_1  Header_2      Header 3 
0         2             1
2         3             2 
4                       3  


File 2:     

Header_1   Header_4      Header_3  Header_2
6          4             3         1
8          5             4 
10

Desired Output
Header_1   Header_2   Header_3   Header_4 File_Name
0          2          1                   File 1
2          3          2                   File 1
4                     3                   File 1
6          1          3          4        File 2
8                     4          5        File 2
10

我用来尝试这个的代码是:

import os
import pandas as pd
import glob

#set working directory
os.chdir("/filepath/")

globbed_files = glob.glob("*.csv") #creates a list of all csv files
print(globbed_files)
data = [] # pd.concat takes a list of dataframes as an agrument
for csv in globbed_files:
    frame = pd.read_csv(csv)
    data.append(frame)
    print (frame) #to check while running whether the frame was correct


bigframe = pd.concat(data, ignore_index=True, keys=globbed_files) 
bigframe.to_csv("output.csv")

如果需要,我可以放弃文件名,空单元格可以是 NaN 或只是空的,这很好。但是现在我的标题没有对齐,我会得到完全不匹配的列。

1 个答案:

答案 0 :(得分:0)

您的代码似乎有效。我刚刚添加了 File_name 列并重新排列了列

import pandas as pd
# you can use your own files here, I'm just using this to test
df1 = pd.DataFrame({"header_1":[1,2,3,4],"header_2": [2,3,4,6]})
df2 = pd.DataFrame({"header_4":[1,2,3,4],"header_3": [2,3,4,5]})

globbed_files = [df1,df2] #creates a list of all csv files
print(globbed_files)
data = [] # pd.concat takes a list of dataframes as an argument
i=1 # use this to set the file name counter
for csv in globbed_files:
    frame = csv
    frame["File_Name"] = "File " + str(i) # File_Name values are set here
    data.append(frame)
    i+=1

bigframe = pd.concat(data, ignore_index=True, keys=globbed_files)
bigframe = bigframe.reindex(sorted(df.columns), axis=1) # this arranges your columns alphabetically
bigframe = bigframe[ [ col for col in bigframe.columns if col != 'File_Name' ] + ['File_Name'] ] # takes File_Name column to the end