循环Excel文件并基于Python中的一个公共列进行合并

时间:2019-10-24 08:33:09

标签: python pandas dataframe

对于一个文件夹中的多个csv文件,我希望循环所有以csv结尾的文件并合并为一个excel文件,这里我举两个例子:

first.csv

     date    a    b
0  2019.1  1.0  NaN
1  2019.2  NaN  2.0
2  2019.3  3.0  2.0
3  2019.4  3.0  NaN

second.csv

     date    c     d
0  2019.1  1.0   NaN
1  2019.2  5.0   2.0
2  2019.3  3.0   7.0
3  2019.4  6.0   NaN
4  2019.5  NaN  10.0

...

我想要的输出是这样的,基于date合并它们:

        date    a     b    c    d
0  2019/1/31  1.0   NaN  1.0  NaN
1  2019/2/28  NaN   2.0  5.0  2.0
2  2019/3/31  3.0   2.0  3.0  7.0
3  2019/4/30  3.0   NaN  6.0  NaN
4  2019/5/31  NaN   NaN  NaN  10.0

我已经编辑了以下代码,但是显然date转换和合并dfs的某些部分是错误的:

import numpy as np
import pandas as pd
import glob

dfs = pd.DataFrame()
for file_name in glob.glob("*.csv"):
    # print(file_name)
    df = pd.read_csv(file_name, engine='python', skiprows=2, encoding='utf-8')
    df = df.dropna()
    df = df.dropna(axis = 1)
    df['date'] = pd.to_datetime(df['date'], format='%Y.%m')
    ...
    dfs = pd.merge(df1, df2, on = 'date', how= "outer")

# save the data frame
writer = pd.ExcelWriter('output.xlsx')
dfs.to_excel(writer,'sheet1')
writer.save()

请帮助我。谢谢。

2 个答案:

答案 0 :(得分:1)

像这样尝试:

import numpy as np
import pandas as pd
import glob
from pandas.tseries.offsets import MonthEnd

dfs = pd.DataFrame()
for file_name in glob.glob("*.csv"):
    df = pd.read_csv(file_name, engine='python', skiprows=2, encoding='utf-8')
    df.columns = df.columns.str.lower().str.replace('dates', 'date')
    df = df.dropna()
    df = df.dropna(axis = 1)
    df['date'] = pd.to_datetime(df['date'].astype(str), format='%Y.%m') + MonthEnd(1)
    if dfs.empty:
        dfs = df.copy()
    else:
        dfs = dfs.merge(df, on='date', how="outer")

答案 1 :(得分:1)

concatDatetimeIndex由参数read_csvindex_col中创建的parse_dates0dfs = [] for file_name in glob.glob("*.csv"): df = pd.read_csv(file_name, engine='python', skiprows=2, encoding='utf-8', index_col=0, parse_dates=[0]) #if necessary some processing dfs.append(df) df = pd.concat(dfs, axis=1) df.index = df.index + pd.offsets.MonthEnd() print (df) a b c d date 2019-01-31 1.0 NaN 1.0 NaN 2019-02-28 NaN 2.0 5.0 2.0 2019-03-31 3.0 2.0 3.0 7.0 2019-04-30 3.0 NaN 6.0 NaN 2019-05-31 NaN NaN NaN 10.0 一起用于第一列数据,最后添加最后一个每月最后一天以提高效果:

public void Execute(IJobExecutionContext context)
{
    if (_isMaintenanceSystem)
    {
        // Delay job
        // When delay, job fires and keep old scheduler as normal.
    }

    SendMail(_emailSetting, fileAttachment);
}