列上的操作多个文件Pandas

时间:2015-06-26 13:02:11

标签: python file csv pandas time-series

我正在尝试在Python Pandas中执行一些算术运算,并将结果合并到其中一个文件中。

Path_1: File_1.csv, File_2.csv, ....

此路径有几个文件,应该在时间间隔内增加。以下列

    File_1.csv    |  File_2.csv
    Nos,12:00:00  |  Nos,12:30:00

    123,1451         485,5464
    656,4544         456,4865
    853,5484         658,4584

Path_2: Master_1.csv

Nos,00:00:00
123,2000
485,1500
656,1000
853,2500
456,4500
658,5000

我正在尝试从n中读取.csvPath_1col[1]个文件,并将col[last]标题时间序列与Master_1.csv个时间序列{{1}进行比较}。

如果Master_1.csv没有时间,则应创建一个包含path_1 .csv个文件的时间序列的新列,并在尊重col['Nos']时更新值,同时从col[1]减去Master_1.csv col

如果来自path_1 file的时间col['Nos']存在,则查找NAN,然后将col['Nos']替换为相对于Nos,00:00:00,12:00:00,12:30:00, 123,2000,549,NAN, 485,1500,NAN,3964, 656,1000,3544,NAN 853,2500,2984,NAN 456,4500,NAN,365 658,5000,NAN,-416 的减去值。

即。

Master_1.csv中的预期输出

Nos

我可以理解算术计算,但我无法循环使用timeseriesimport pandas as pd import numpy as np path_1 = '/' path_2 = '/' df_1 = pd.read_csv(os.path_1('/.*csv'), Index=None, columns=['Nos', 'timeseries'] #times series is different in every file eg: 12:00, 12:30, 17:30 etc df_2 = pd.read_csv('master_1.csv', Index=None, columns=['Nos', '00:00:00']) #00:00:00 time series for Nos in df_1 and df_2: df_1['Nos'] = df_2['Nos'] new_tseries = df_2['00:00:00'] - df_1['timeseries'] merged.concat('master_1.csv', Index=None, columns=['Nos', '00:00:00', 'new_tseries'], axis=0) # new_timeseries is the dynamic time series that every .csv file will have from path_1 我试图将一些代码放在一起并尝试解决循环问题。在这方面需要帮助。谢谢

//setting the preview surface layout file

<?xml version="1.0" encoding="utf-8"?>

1 个答案:

答案 0 :(得分:2)

您可以分三步完成

  1. 将您的csv读入数据框列表
  2. 将数据帧合并在一起(相当于SQL左连接或Excel VLOOKUP
  3. 使用向量化减法计算派生列。
  4. 以下是您可以尝试的一些代码:

    #read dataframes into a list
    import glob
    L = []
    for fname in glob.glob(path_1+'*.csv'):
       L.append(df.read_csv(fname))
    
    #read master dataframe, and merge in other dataframes
    df_2 = pd.read_csv('master_1.csv')
    for df in L:
       df_2 = pd.merge(df_2,df, on = 'Nos', how = 'left')
    
    #for each column, caluculate the difference with the master column
    df_2.apply(lambda x: x - df_2['00:00:00'])