Question

如果我有多个csv文件，每个文件包含按日期索引的时间序列数据。有没有办法创建一个包含所有数据的数据帧，并且索引调整了以前在先前文件中可能没有看到的新日期。比如说我读了时间序列1：

03/01/2001  2.984
04/01/2001  3.016
05/01/2001  2.891
08/01/2001  2.527
09/01/2001  2.445
11/01/2001  2.648
12/01/2001  2.803
15/01/2001  2.943

数据框看起来与上面的数据非常相似。但如果我再读另一个文件说时间序列2

02/01/2001  24.75
03/01/2001  24.35
04/01/2001  25.1
08/01/2001  23.5
09/01/2001  23.6
10/01/2001  24.5
11/01/2001  24.7
12/01/2001  24.4

您可以看到时间序列1的值为05/01/2001，而时间序列2的值不是。时间序列2也包含02/01/2001和10/01/2001的数据点。那么有没有办法最终得到以下结论：

02/01/2001  null    24.75 ..etc
03/01/2001  2.984   24.35 ..etc
04/01/2001  3.016   25.1  ..etc
05/01/2001  2.891   null  ..etc
08/01/2001  2.527   23.5  ..etc
09/01/2001  2.445   23.6  ..etc
10/01/2001  null    24.5  ..etc
11/01/2001  2.648   24.7  ..etc
12/01/2001  2.803   24.4  ..etc
15/01/2001  2.943   null  ..etc

其中索引是否针对新日期进行了调整，而当天没有数据的任何时间序列都设置为null或某些此类值？

到目前为止我的代码是相当基本的，我可以浏览一个目录并打开.csv文件并将它们准备好放入数据帧但我不知道如何以上述方式将数据帧组合在一起。

    def getTimeseriesData(DataPath,columnNum,startDate,endDate):
        #print('startDate: ',startDate,' endDate: ',endDate)
        colNames = ['date']

        path = DataPath
        print('DataPath: ',DataPath)
        filePath = path, "*.csv"
        allfiles = glob.glob(os.path.join(path, "*.csv"))
        for fname in allfiles:
            name = os.path.splitext(fname)[0]
            name = os.path.split(name)[1]

            colNames.append(name)

        dataframes = [pd.read_csv(fname, header=None,usecols=[0,columnNum]) for fname in allfiles]
#not sure of the next bit

非常感谢任何帮助。

非常感谢

Answer 1

pd.concat可用于将DataFrame与不同的索引连接起来。例如，

df1 = pd.DataFrame({'A': list('ABCDE')}, index=range(5))
df2 = pd.DataFrame({'B': list('ABCDE')}, index=range(2,7))
pd.concat([df1, df2], axis=1)

产量

     A    B
0    A  NaN
1    B  NaN
2    C    A
3    D    B
4    E    C
5  NaN    D
6  NaN    E

请注意df1和df2的索引对齐并使用了NaN 哪里有缺失值。

所以在你的情况下，如果你使用

pd.read_csv(fname, header=None, usecols=[0,column_num], parse_dates=[0],
            index_col=[0], names=['date', name]))

index_col=[0]会使第一列成为DataFrame的索引，以便稍后调用

dfs = pd.concat(dfs, axis=1)

将生成一个DataFrame，其中所有DataFrame都根据日期进行对齐。

将data1.csv和data2.csv置于~/tmp，

import glob
import os
import pandas as pd

def get_timeseries_data(path, column_num):
    colNames = ['date']
    dfs = []
    allfiles = glob.glob(os.path.join(path, "*.csv"))
    for fname in allfiles:
        name = os.path.splitext(fname)[0]
        name = os.path.split(name)[1]
        colNames.append(name)
        df = pd.read_csv(fname, header=None, usecols=[0, column_num], 
                        parse_dates=[0], dayfirst=True,
                        index_col=[0], names=['date', name])

        # aggregate rows with duplicate index by taking the mean
        df = df.groupby(level=0).agg('mean')

        # alternatively, drop rows with duplicate index
        # http://stackoverflow.com/a/34297689/190597 (n8yoder)
        # df = df[~df.index.duplicated(keep='first')]

        dfs.append(df)
    dfs = pd.concat(dfs, axis=1)
    return dfs

path = os.path.expanduser('~/tmp/tmp')
column_num = 1
dfs = get_timeseries_data(path, column_num)
print(dfs)

产量

            data1  data2
date                    
2001-01-02    NaN  24.75
2001-01-03  2.984  24.35
2001-01-04  3.016  25.10
2001-01-05  2.891    NaN
2001-01-08  2.527  23.50
2001-01-09  2.445  23.60
2001-01-10    NaN  24.50
2001-01-11  2.648  24.70
2001-01-12  2.803  24.40
2001-01-15  2.943    NaN

Answer 2

也许不是最优雅的，但我会创建一个时间序列索引，从所有csv文件的最小日期到最大日期，调用该数据帧说df，然后执行df [＆＃39; file1＆＃39;] = pd.read_csv（＆＃39; file1.csv＆＃39）。然后，您将拥有一些将全部为NaN的行，您可以对这些行进行过滤并将其删除。

Answer 3

使用merge尝试这样的事情。

df1 = pd.DataFrame([['03/01/2001', 2.984],['04/01/2001', 3.016],['05/01/2001',2.891],['08/01/2001', 2.527],
       ['09/01/2001', 2.445],['11/01/2001',2.648],
       ['12/01/2001', 2.803],['15/01/2001',2.943]], columns = ['date','field'])

df2 = pd.DataFrame([['02/01/2001',  24.75],['03/01/2001',  24.35],['04/01/2001', 25.1],['08/01/2001',  23.5],
       ['09/01/2001',  23.6], ['10/01/2001',  24.5],['11/01/2001',  24.7],['12/01/2001',  24.4]], columns = ['date','field'])

#files in your directory
files= [df1,df2]

fileNo = 1
for currFile in files:
    if fileNo ==1:
        df = currFile
    else:
        currFile.rename(columns = {'field':'field_fromFile_' + str(fileNo)})
        df = pd.merge(df, currFile, how = 'outer',left_on = 'date',right_on = 'date')
    fileNo =fileNo + 1

使用调整索引将时间序列写入数据帧

3 个答案: