我不想学习Pandas语法,而是花费一些时间,我想使用我现有的SQL技巧来操纵Python中的Pandas数据帧。
出于各种原因,我的数据已经存在于Pandas数据帧中。
如果有人知道答案,你能告诉我一个让我入手的简单例子吗?以下是我的猜测:
import sqlalchemy
query = """
select
trunc(date, 'MM') as month
, avg(case when hour_ending > 7 and hour_ending < 24 then volume else null end) as volume_peak
, avg(case when hour_ending <= 7 or hour_ending = 24 then volume else null end) as volume_off_peak
from pandas_df
where date >= add_months(trunc(sysdate, 'DD'), -2)
group by trunc(date, 'MM')
order by trunc(date, 'MM');
"""
new_df = sqlalchemy.run_query(query)
谢谢! 肖恩
答案 0 :(得分:2)
同时,考虑学习大熊猫的条件逻辑/过滤/聚合方法,这些方法可以从您的SQL技能中翻译出来:
trunc(..., 'MM') ---> .dt.month
CASE WHEN ---> df.loc or np.where()
WHERE ---> []
GROUP BY ---> .groupby
AVG --> .agg('mean')
这是一个例子
随机数据 (播种可重复性)
import numpy as np
import pandas as pd
import datetime as dt
import time
epoch_time = int(time.time())
np.random.seed(101)
pandas_df = pd.DataFrame({'date': [dt.datetime.fromtimestamp(np.random.randint(1480000000, epoch_time)) for _ in range(50)],
'hour_ending': [np.random.randint(14) for _ in range(50)],
'volume': abs(np.random.randn(50)*100)})
# OUTPUT CSV AND IMPORT INTO DATABASE TO TEST RESULT
pandas_df.to_csv('Output.csv', index=False)
print(pandas_df.head(10))
# date hour_ending volume
# 0 2017-01-01 20:05:19 10 56.660415
# 1 2017-09-02 00:56:27 3 79.060800
# 2 2018-01-04 09:25:05 7 23.076240
# 3 2016-11-27 23:44:55 6 102.801241
# 4 2017-01-29 12:19:55 5 88.824230
# 5 2017-04-15 15:16:09 6 214.168659
# 6 2017-09-07 08:12:45 9 97.607635
# 7 2017-12-31 15:35:36 13 141.467249
# 8 2017-04-21 23:01:44 13 156.246854
# 9 2016-12-22 09:27:49 2 67.646662
汇总 (在汇总前计算列数)
# CALCULATE GROUP MONTH COLUMN
pandas_df['month'] = pandas_df['date'].dt.month
# CONDITIONAL LOGIC COLUMNS
pandas_df.loc[pandas_df['hour_ending'].between(7,24, inclusive = False), 'volume_peak'] = pandas_df['volume']
pandas_df.loc[~pandas_df['hour_ending'].between(7,24, inclusive = False), 'volume_off_peak'] = pandas_df['volume']
print(pandas_df.head(10))
# date hour_ending volume month volume_peak volume_off_peak
# 0 2017-01-01 20:05:19 10 56.660415 1 56.660415 NaN
# 1 2017-09-02 00:56:27 3 79.060800 9 NaN 79.060800
# 2 2018-01-04 09:25:05 7 23.076240 1 NaN 23.076240
# 3 2016-11-27 23:44:55 6 102.801241 11 NaN 102.801241
# 4 2017-01-29 12:19:55 5 88.824230 1 NaN 88.824230
# 5 2017-04-15 15:16:09 6 214.168659 4 NaN 214.168659
# 6 2017-09-07 08:12:45 9 97.607635 9 97.607635 NaN
# 7 2017-12-31 15:35:36 13 141.467249 12 141.467249 NaN
# 8 2017-04-21 23:01:44 13 156.246854 4 156.246854 NaN
# 9 2016-12-22 09:27:49 2 67.646662 12 NaN 67.646662
# WHERE AND GROUPBY
agg_df = pandas_df[pandas_df['date'] >= (dt.datetime.today() - dt.timedelta(days=60))]\
.groupby('month')[['volume_peak', 'volume_off_peak']].agg('mean')
print(agg_df)
# volume_peak volume_off_peak
# month
# 1 62.597999 23.076240
# 11 37.775000 17.075594
# 12 141.063694 29.986261
SQL (使用从上面的代码导入csv的MS Access)
SELECT Month(p.date) As month,
AVG(IIF(p.hour_ending >7 and p.hour_ending < 24, volume, NULL)) as volume_peak,
AVG(IIF(p.hour_ending <=7 or p.hour_ending = 24, volume, NULL)) as volume_off_peak
FROM csv_data p
WHERE p.Date >= DateAdd('m', -2, Date())
GROUP BY Month(p.date)
-- month peak off_peak
-- 1 62.5979990645683 23.0762401295465
-- 11 37.7750002748325 17.0755937444385
-- 12 141.063693957234 29.9862605960166
答案 1 :(得分:1)
简短的回答是否定的,我没有直接的方法来使用SQL Alchemy来操纵pandas DataFrame。
如果你真的想使用SQL,我认为最简单的方法是编写一些辅助函数,使用像SQLite这样的数据库驱动程序将pandas DataFrame转换为SQL表格。
然而,作为一名精通SQL的开发人员,他最近不得不在pandas DataFrame中处理一些数据,我强烈建议您只是学习大熊猫的做事方式。它不像SQL那样具有声明性 - 通常SQL中的单个查询必须分解为pandas中的几个步骤。然而,我发现工作大熊猫很快就会变得容易,学习过程也是有益和令人满意的。