过去几天我对此问题进行了大量研究,但我仍无法找到解决问题的建议。
以下是我的数据框标题为“dfs'”的示例。大约有80列,下面的例子中只显示了4列。
dfs是一个大型数据框,包含超过12个月每15分钟报告一次的数据行(即2015-08-01 00:00:00至2016-09-30 23:45:00)。 Datetime列的格式为datetime。
...
...
我想导出(或写入)多个月度csv文件,这些文件是从原始大型csv文件(dfs)中获取的每月数据的片段。对于每个月,我想要写一个文件,其中包含原始数据,日数据(早上6点至下午6点)和夜间数据(下午6点至早上6点)。我还想让每个月度文件的名称自动化,以便它知道是自称为dfs_%Y%m,还是dfs_day_%Y%m,或dfs_night_%Y%m,具体取决于它所包含的数据。
目前我正在写出超过180行代码来导出每个csv文件。
例如:
我通过从日期时间列索引中获取下面列出的日期时间之间的数据来创建每月的原始,日夜文件
dfs201508 = dfs.ix['2015-08-01 00:00:00':'2015-08-31 23:45:00']
dfs201508Day = dfsDay.ix['2015-08-01 00:00:00':'2015-08-31 23:45:00']
dfs201508Night = dfsNight.ix['2015-08-01 00:00:00':'2015-08-31 23:45:00']
然后我将这些文件导出到各自的输出路径并为其提供文件名
dfs201508 = dfs201508.to_csv(outputpath+"dfs201508.csv")
dfs201508Day = dfs201508Day.to_csv(outputpathDay+"dfs_day_201508.csv")
dfs201508Night = dfs201508Night.to_csv(outputpathNight+"dfs_night_201508.csv")
我想写的是这样的
dfs_%Y%m = dfs.ix["%Y%m"]
dfs_day_%Y%m = dfs.ix["%Y%m(between 6am-6pm)"]
dfs_night_%Y%m = dfs.ix["%Y%m(between 6pm-6am)"]
dfs_%Y%m = dfs_%Y%m.to_csv(outputpath +"dfs_%Y%m.csv")
dfs_day_%Y%m = dfs_day_%Y%m.to_csv(outputpath%day +"dfs_day_%Y%m.csv")
dfs_night_%Y%m = dfs_night_%Y%m.to_csv(outputpath%night +"dfs_night_%Y%m.csv")
对于自动执行此过程的代码的任何建议都将不胜感激。
以下是我研究过的网页的一些链接:
https://www.youtube.com/watch?v=aeZKJGEfD7U
答案 0 :(得分:1)
您可以使用for
循环来迭代dfs
中包含的年份和月份。我在下面的示例中创建了一个名为DF
的虚拟数据框,其中只包含三个示例列:
dates Egen1_kwh Egen2_kwh
2016-01-01 00:00:00 15895880 15877364
2016-01-01 00:15:00 15895880 15877364
2016-01-01 00:30:00 15895880 15877364
2016-01-01 00:45:00 15895880 15877364
2016-01-01 01:00:00 15895880 15877364
以下代码会将主要数据框DF
过滤为每年中每个月的较小数据框(NIGHT
和DAY
),并将其保存为.csv
,并带有名称对应于他们的日期(例如2016年1月和2016年1月夜的2016_1_DAY
和2016_1_NIGHT
。)
import pandas as pd
import datetime
from dateutil.relativedelta import relativedelta
from random import randint
# I defined a sample dataframe with dummy data
start = datetime.datetime(2016,1,1,0,0)
dates = [start + relativedelta(minutes=15*i) for i in range(0,10000)]
Egen1_kwh = randint(15860938,15898938)
Egen2_kwh = randint(15860938,15898938)
DF = pd.DataFrame({
'dates': dates,
'Egen1_kwh': Egen1_kwh,
'Egen2_kwh': Egen2_kwh,
})
# define when day starts and ends (MUST USE 24 CLOCK)
day = {
'start': datetime.time(6,0), # start at 6am (6:00)
'end': datetime.time(18,0) # ends at 6pm (18:00)
}
# capture years that appear in dataframe
min_year = DF.dates.min().year
max_year = DF.dates.max().year
if min_year == max_year:
yearRange = [min_year]
else:
yearRange = range(min_year, max_year+1)
# iterate over each year and each month within each year
for year in yearRange:
for month in range(1,13):
# filter to show NIGHT and DAY dataframe for given month within given year
NIGHT = DF[(DF.dates >= datetime.datetime(year, month, 1)) &
(DF.dates <= datetime.datetime(year, month, 1) + relativedelta(months=1) - relativedelta(days=1)) &
((DF.dates.apply(lambda x: x.time()) <= day['start']) | (DF.dates.apply(lambda x: x.time()) >= day['end']))]
DAY = DF[(DF.dates >= datetime.datetime(year, month, 1)) &
(DF.dates <= datetime.datetime(year, month, 1) + relativedelta(months=1) - relativedelta(days=1)) &
((DF.dates.apply(lambda x: x.time()) > day['start']) & (DF.dates.apply(lambda x: x.time()) < day['end']))]
# save to .csv with date and time in file name
# specify the save path of your choice
path_night = 'C:\\Users\\nickb\\Desktop\\stackoverflow\\{0}_{1}_NIGHT.csv'.format(year, month)
path_day = 'C:\\Users\\nickb\\Desktop\\stackoverflow\\{0}_{1}_DAY.csv'.format(year, month)
# some of the above NIGHT / DAY filtering will return no rows.
# Check for this, and only save if the dataframe contains rows
if NIGHT.shape[0] > 0:
NIGHT.to_csv(path_night, index=False)
if DAY.shape[0] > 0:
DAY.to_csv(path_day, index=False)