我是大熊猫的新手,我每周都会使用包含多年记录的csv作为源生成数据透视表。
我无法找到按周过滤数据帧的正确方法。这就是我手动操作的方式:
fmindate = (df.fecha.astype( 'datetime64[ns]' ) >= pd.to_datetime( "2017-03-01" ))
fmaxdate = (df.fecha.astype( 'datetime64[ns]' ) <= pd.to_datetime( "2018-01-15" ))
dffiltered = df[ (fmindate & fmaxdate) ]
txt = pd.pivot_table(
dffiltered,
columns=[ "fecha" ],
index=[ "org", "tipo", "estado" ],
values=[ "destination", "time_total", "time_avg" ],
aggfunc={ "destination": len, "time_total": total_secs_inTimeSerie,
"time_avg": mean_secs_inTimeSerie },
fill_value="", margins=True
)
with open(report_name, "w") as text_file:
text_file.write ( txt.to_html() )
什么是正确的方法?
非常感谢你!
答案 0 :(得分:2)
您可以使用weekofyear
:
rng = pd.date_range('2017-04-03', periods=6, freq='6M')
df = pd.DataFrame({'org':list('aaabbb'),
'estado':list('cccbbb'),
'destination':[4,5,4,5,5,4],
'time_total':[7,8,9,4,2,3],
'time_avg':[1,3,5,7,1,0],
'fecha':rng,
'tipo':list('aaabbb')})
df["fecha"] = df["fecha"].dt.weekofyear
print (df)
destination estado fecha org time_avg time_total tipo
0 4 c 17 a 1 7 a
1 5 c 44 a 3 8 a
2 4 c 18 a 5 9 a
3 5 b 44 b 7 4 b
4 5 b 18 b 1 2 b
5 4 b 44 b 0 3 b
def total_secs_inTimeSerie(x):
return x.sum()
def mean_secs_inTimeSerie(x):
return x.mean()
txt = pd.pivot_table(
df,
columns=[ "fecha" ],
index=[ "org", "tipo", "estado" ],
values=[ "destination", "time_total", "time_avg" ],
aggfunc={ "destination": len, "time_total": total_secs_inTimeSerie,
"time_avg": mean_secs_inTimeSerie },
fill_value="", margins=True
)
print (txt)
destination time_avg \
fecha 17 18 44 All 17 18 44 All
org tipo estado
a a c 1 1.0 1.0 3 1 5.0 3.000000 3.000000
b b b 1.0 2.0 3 1.0 3.500000 2.666667
All 1 2.0 3.0 6 1 3.0 3.333333 2.833333
time_total
fecha 17 18 44 All
org tipo estado
a a c 7 9.0 8.0 24
b b b 2.0 7.0 9
All 7 11.0 15.0 33
答案 1 :(得分:0)
也许我的解释不够明确。从源csv我需要一个完全像我显示的数据透视表,但每周一个在源记录范围内。 终于找到了解决方案,欢迎任何改进。也许对其他人有用:
def week_range( date ):
""" Utility function. Returns start and end dates for a given date (starting Monday) """
year, week, dow = date.isocalendar()
if dow == 7:
start_date = date
else:
start_date = date - timedelta( dow )
end_date = start_date + timedelta( 6 )
return (start_date, end_date)
# Load and get first date (fecha) in the sorted by date csv file
df = pd.read_csv("data.csv")
nextDate = df.head(1)
nextDate = pd.to_datetime(nextDate["fecha"].values[0])
continueNextWeek = True
while continueNextWeek:
# Get week start and end dates for the current date
datefrom, dateto = week_range( nextDate )
fmindate = (df.fecha.astype( 'datetime64[ns]' ) >= datefrom )
fmaxdate = (df.fecha.astype( 'datetime64[ns]' ) <= dateto )
# Get a by date filtered dataframe
dfFiltered = df[ (df.tipo == "Saliente") & (fmindate & fmaxdate) ]
# If there are records for that week, generate pivot table
if dfFiltered.shape[0]:
txt = pd.pivot_table(
dfFiltered,
columns=[ "fecha" ],
index=[ "org", "tipo", "estado" ],
values=[ "destino", "duracion_total", "duracion_media" ],
aggfunc={ "destino": len, "duracion_total": total_secs_inTimeSerie, "duracion_media": mean_secs_inTimeSerie },
fill_value="", margins=True
)
# Write it to disk
report_name = "report-{}-{}.html".format(datefrom.strftime('%Y-%m-%d'),dateto.strftime('%Y-%m-%d'))
with open(os.path.join("out",report_name), "w") as text_file:
text_file.write ( txt.to_html() )
# Stop if the week end date is greater than last record in set
if dateto > df.tail(1)["fecha"].astype( 'datetime64[ns]').values[0]:
continueNextWeek = False
else:
nextDate = dateto + timedelta(1)