
时间:2016-11-24 09:14:36

标签: python csv datetime for-loop export-to-csv


我有很多(超过100个)csv文件。所有csv文件都将“Datetime”作为其第一列。 “日期时间”格式在“YYYY-MM-DD HH:MM:SS”中。每个文件在整个月内每15分钟提供一行数据(很多行数据)。所有csv文件都位于三个单独的文件夹中,每个文件夹都有以下路径:

“C:\用户\文档\ SummaryData \ 24小时”

“C:\用户\文档\ SummaryData \白天”

“C:\用户\文档\ SummaryData \夜间”

24小时文件夹中的csv文件跨越24小时时间范围。 对于MM:SS,Daytime文件夹中的csv文件跨越06:00 - 18:00。 对于MM:SS,Nightime文件夹中的csv文件的时间跨度为18:00 - 06:00。

例如,2015年8月份存在一个csv文件。对于这个月,在24小时文件夹中,我们有一个csv文件,提供整个8月份的15分钟间隔数据。 2015。


enter image description here


enter image description here





另一个棘手的部分是,我还需要将这些文件导出为csv文件(到任何输出位置,比如说“C:\ Users \ cp_vm \ Documents \ Output”)并自动重命名以表示它们是如何重新采样的







我知道我需要使用某种形式的'for loop'而且我确实尝试过。但我无法想出这个。任何帮助将不胜感激!感谢所有已提供建议的人。


enter image description here


enter image description here

2 个答案:

答案 0 :(得分:1)





dates               PRp         PRe         Norm_Eff    SR_Gen      SR_All
2016-01-01 00:00:00 0.269389    0.517720    0.858603    8123.746453 8770.560467
2016-01-01 00:15:00 0.283316    0.553203    0.862253    7868.675481 8130.974409
2016-01-01 00:30:00 0.286590    0.693997    0.948463    8106.217144 8314.584848

以上将导致每月共有import pandas as pd import datetime from dateutil.relativedelta import relativedelta from random import randint import random import calendar # I defined a sample dataframe with dummy data start = datetime.datetime(2016,1,1,0,0) r = range(0,10000) dates = [start + relativedelta(minutes=15*i) for i in r] PRp = [random.uniform(.2, .3) for i in r] PRe = [random.uniform(0.5, .7) for i in r] Norm_Eff = [random.uniform(0.7, 1) for i in r] SR_Gen = [random.uniform(7500, 8500) for i in r] SR_All = [random.uniform(8000, 9500) for i in r] DF = pd.DataFrame({ 'dates': dates, 'PRp': PRp, 'PRe': PRe, 'Norm_Eff': Norm_Eff, 'SR_Gen': SR_Gen, 'SR_All': SR_All, }) # define when day starts and ends (MUST USE 24 CLOCK) day = { 'start': datetime.time(6,0), # start at 6am (6:00) 'end': datetime.time(18,0) # ends at 6pm (18:00) } # capture years that appear in dataframe min_year = DF.dates.min().year max_year = DF.dates.max().year if min_year == max_year: yearRange = [min_year] else: yearRange = range(min_year, max_year+1) # iterate over each year and each month within each year for year in yearRange: for month in range(1,13): # filter to show NIGHT and DAY dataframe for given month within given year NIGHT = DF[(DF.dates >= datetime.datetime(year, month, 1)) & (DF.dates <= datetime.datetime(year, month, 1) + relativedelta(months=1) - relativedelta(days=1)) & ((DF.dates.apply(lambda x: x.time()) <= day['start']) | (DF.dates.apply(lambda x: x.time()) >= day['end']))] DAY = DF[(DF.dates >= datetime.datetime(year, month, 1)) & (DF.dates <= datetime.datetime(year, month, 1) + relativedelta(months=1) - relativedelta(days=1)) & ((DF.dates.apply(lambda x: x.time()) > day['start']) & (DF.dates.apply(lambda x: x.time()) < day['end']))] # Create resampled dataframes on Hourly, Daily, Monthly basis for resample_freq, freq_tag in zip(['H','D','M'], ['Hourly','Daily','Monthly']): NIGHT.index = NIGHT.dates # resampled column must be placed in index NIGHT_R = pd.DataFrame(data={ 'PRp': NIGHT.PRp.resample(rule=resample_freq).mean(), # averaging data 'PRe': NIGHT.PRe.resample(rule=resample_freq).mean(), 'Norm_Eff': NIGHT.Norm_Eff.resample(rule=resample_freq).mean(), 'SR_Gen': NIGHT.SR_Gen.resample(rule=resample_freq).sum(), # summing data 'SR_All': NIGHT.SR_All.resample(rule=resample_freq).sum() }) NIGHT_R.dropna(inplace=True) # removes the times during 'day' (which show as NA) DAY.index = DAY.dates DAY_R = pd.DataFrame(data={ 'PRp': DAY.PRp.resample(rule=resample_freq).mean(), 'PRe': DAY.PRe.resample(rule=resample_freq).mean(), 'Norm_Eff': DAY.Norm_Eff.resample(rule=resample_freq).mean(), 'SR_Gen': DAY.SR_Gen.resample(rule=resample_freq).sum(), 'SR_All': DAY.SR_All.resample(rule=resample_freq).sum() }) DAY_R.dropna(inplace=True) # removes the times during 'night' (which show as NA) # save to .csv with date and time in file name # specify the save path of your choice path_night = 'C:\\Users\\nickb\\Desktop\\stackoverflow\\{0}{1}_NIGHT_{2}.csv'.format(year, calendar.month_name[month], freq_tag) path_day = 'C:\\Users\\nickb\\Desktop\\stackoverflow\\{0}{1}_DAY_{2}.csv'.format(year, calendar.month_name[month], freq_tag) # some of the above NIGHT_R / DAY_R filtering will return no rows. # Check for this, and only save if the dataframe contains rows if NIGHT_R.shape[0] > 0: NIGHT_R.to_csv(path_night, index=True) if DAY_R.shape[0] > 0: DAY_R.to_csv(path_day, index=True) 个文件:

  1. 白天的每小时基础
  2. 白天的每日基础
  3. 白天的每月基础
  4. 每小时夜间基础
  5. 夜间的每日基础
  6. 夜间月度
  7. 每个文件的文件名如下:(年)(月_名)(日/夜)(频率)。例如:.csv


    此外,以下是您可以选择的2016August_NIGHT_Daily个频率列表:pandas resample documentation

答案 1 :(得分:0)


为了避免在重新采样时间段内写出每个列名以及是否“意味着”或“求和”数据,我手动创建了另一个excel文档,其中列出了第1行中的列标题并列出了“mean”或标题下面的“sum”(n *列x 2行),然后我将此csv转换为字典并在重新采样代码中引用它。见下文。


import pandas as pd
import glob

#project specific paths - comment (#) all paths not relevant

#read in manually created re-sampling csv file to reference later as a dictionary in the re-sampling code
#the file below consists of n*columns x 2 rows, where row 1 is the column headers and row 2 specifies whether that column is to be averaged ('mean') or summed ('sum') over the re-sampling time period
f =pd.read_csv('C:/Users/cp_vm/Documents/ResampleData/AllData.csv')

#convert manually created resampling csv to dictionary ({'columnname': resample,'columnname2': resample2)}
recordcol = list(f.columns.values)
recordrow = f.iloc[0,:]
what_to_do = dict(zip(f.columns, [how_map[x] for x in recordcol]))

#this is not very efficient, but for the time being, comment (#) all paths not relevant
#meaning run the script multiple times, each time changing the in' and outpaths
#read in datafiles via their specific paths: order - AllData 24Hour, AllData DayTime, AllData NightTime
inpath = r'C:/Users/cp_vm/Documents/Data/Input/AllData/24Hour/'
outpath = 'C:/Users/cp_vm/Documents/Data/Output/AllData/24Hour/{0}_{1}_{2}_AllData_24Hour.csv'

#inpath = r'C:/Users/cp_vm/Documents/Data/Input/AllData/Daytime/'
#outpath = 'C:/Users/cp_vm/Documents/Data/Output/AllData/Daytime/{0}_{1}_{2}_AllData_Daytime.csv'

#inpath = r'C:/Users/cp_vm/Documents/Data/Input/AllData/Nighttime/'
#outpath = 'C:/Users/cp_vm/Documents/Data/Output/AllData/Nighttime/{0}_{1}_{2}_AllData_Nighttime.csv'

allFiles = glob.glob(inpath + "/*.csv")

#resample all incoming files to be hourly-h, daily-d, or monthly-m and export with automatic naming of files
for files_ in allFiles:
    #read in all files_
    df = pd.read_csv(files_,index_col = None, parse_dates = ['Datetime'])
    df.index = pd.to_datetime(df.Datetime)
    #change Datetime column to be numeric, so it can be resampled without being removed
    df['Datetime'] = pd.to_numeric(df['Datetime'])
    #specify year and month for automatic naming of files
    year = df.index.year[1]
    month = df.index.month[1]
    #comment (#) irrelevant resamplping, so run it three times, changing h, d and m
    resample = "h"
    #resample = "d"
    #resample = "m"
    #resample df based on the dictionary defined by what_to_do and resample - please note that 'Datetime' has the resampling 'min' associated to it in the manually created re-sampling csv file
    df = df.resample(resample).agg(what_to_do)
    #drop rows where all column values are non existent
    df = df.dropna(how='all')
    #change Datetime column back to datetime.datetime format
    df.Datetime = pd.to_datetime(df.Datetime)
    #make datetime column the index
    df.index = df.Datetime
    #move datetime column to the front of dataframe
    cols = list(df.columns.values)
    df = df[['Datetime'] + cols]
    #export all files automating their names dependent on their datetime
    #if the dataframe has any rows, then export it
    if df.shape[0] > 0:
        df.to_csv(outpath.format(year,month,resample), index=False)