在Python Pandas中连接大量CSV文件(30,000)

时间:2015-11-09 12:19:06

标签: python csv pandas

我使用以下功能连接大量CSV文件:

def concatenate():
    files = sort() # input is an array of filenames
    merged = pd.DataFrame()
    for file in files:
        print "concatinating" + file
        if file.endswith('FulltimeSimpleOpt.csv'): # only consider those filenames
            filenamearray = file.split("_")
            f = pd.read_csv(file, index_col=0)
            f.loc[:,'Vehicle'] = filenamearray[0].replace("veh", "")
            f.loc[:,'Year'] = filenamearray[1].replace("year", "")
            if "timelimit" in file:
                f.loc[:,'Timelimit'] = "1"
            else:
                f.loc[:,'Timelimit'] = "0"
            merged = pd.concat([merged, f], axis=0)
    merged.to_csv('merged.csv')

此功能的问题在于它不能很好地处理大量文件(30,000)。我尝试使用100个文件的样本,这些文件正确完成。但是,对于30,000个文件,脚本会在某些时候变慢并崩溃。

如何在Python Pandas中更好地处理大量文件?

1 个答案:

答案 0 :(得分:7)

首先列出dfs,然后连接:

var logic = function( currentDateTime ){

    var d1 = new Date();

    // Check that it's today, so we need to restrict time chooser
    if (currentDateTime.getDate() == d1.getDate() && currentDateTime.getMonth() == d1.getMonth())
    {
        // Adding six hours
        d1.setHours ( d1.getHours() + 6 );    

        // Creating 'HH:MM' string
        var defaultTime = (d1.getHours() < 10 ? "0" : "") + d1.getHours() + ":" + (d1.getMinutes() < 10 ? "0" : "") + d1.getMinutes();

        // Enforce time restriction
        // ('this' is jquery datetimepicker object)
        this.setOptions({
            minTime : defaultTime,
            defaultTime : defaultTime
        });
    }
    else
    {
        // Lift time restriction if selected day is not today
        this.setOptions({
            minTime : false,
            defaultTime : false
        });
    }
};

// Initiate datepicker with custom logic
$('#datetimepicker').datetimepicker({
    onChangeDateTime:logic,
    onShow:logic
});    

你正在做的是通过重复连接来逐步增加你的df,制作一个dfs列表然后连续地连接所有这些都是最优的