SQLAlchemy可以用来操纵Python中的Pandas数据帧吗?

时间:2018-01-22 16:12:16

标签: python sql pandas sqlalchemy

我不想学习Pandas语法,而是花费一些时间,我想使用我现有的SQL技巧来操纵Python中的Pandas数据帧。

出于各种原因,我的数据已经存在于Pandas数据帧中。

如果有人知道答案,你能告诉我一个让我入手的简单例子吗?以下是我的猜测:

import sqlalchemy

query = """
    select 
        trunc(date, 'MM') as month
        , avg(case when hour_ending > 7 and hour_ending < 24 then volume else null end) as volume_peak
        , avg(case when hour_ending <= 7 or hour_ending = 24 then volume else null end) as volume_off_peak
    from pandas_df
    where date >= add_months(trunc(sysdate, 'DD'), -2)
    group by trunc(date, 'MM')
    order by trunc(date, 'MM');
"""

new_df = sqlalchemy.run_query(query)

谢谢! 肖恩

2 个答案:

答案 0 :(得分:2)

同时,考虑学习大熊猫的条件逻辑/过滤/聚合方法,这些方法可以从您的SQL技能中翻译出来:

  1. trunc(..., 'MM') ---> .dt.month
  2. CASE WHEN ---> df.loc or np.where()
  3. WHERE ---> []
  4. GROUP BY ---> .groupby
  5. AVG --> .agg('mean')
  6. 这是一个例子

    随机数据 (播种可重复性)

    import numpy as np
    import pandas as pd
    import datetime as dt
    import time
    
    epoch_time = int(time.time())
    
    np.random.seed(101)
    pandas_df = pd.DataFrame({'date': [dt.datetime.fromtimestamp(np.random.randint(1480000000, epoch_time)) for _ in range(50)],
                              'hour_ending': [np.random.randint(14) for _ in range(50)],
                              'volume': abs(np.random.randn(50)*100)})
    
    # OUTPUT CSV AND IMPORT INTO DATABASE TO TEST RESULT
    pandas_df.to_csv('Output.csv', index=False)
    
    print(pandas_df.head(10))
    
    #                  date  hour_ending      volume
    # 0 2017-01-01 20:05:19           10   56.660415
    # 1 2017-09-02 00:56:27            3   79.060800
    # 2 2018-01-04 09:25:05            7   23.076240
    # 3 2016-11-27 23:44:55            6  102.801241
    # 4 2017-01-29 12:19:55            5   88.824230
    # 5 2017-04-15 15:16:09            6  214.168659
    # 6 2017-09-07 08:12:45            9   97.607635
    # 7 2017-12-31 15:35:36           13  141.467249
    # 8 2017-04-21 23:01:44           13  156.246854
    # 9 2016-12-22 09:27:49            2   67.646662
    

    汇总 (在汇总前计算列数)

    # CALCULATE GROUP MONTH COLUMN
    pandas_df['month'] = pandas_df['date'].dt.month
    
    # CONDITIONAL LOGIC COLUMNS
    pandas_df.loc[pandas_df['hour_ending'].between(7,24, inclusive = False), 'volume_peak'] = pandas_df['volume'] 
    pandas_df.loc[~pandas_df['hour_ending'].between(7,24, inclusive = False), 'volume_off_peak'] = pandas_df['volume'] 
    
    print(pandas_df.head(10))
    #                  date  hour_ending      volume  month  volume_peak  volume_off_peak
    # 0 2017-01-01 20:05:19           10   56.660415      1    56.660415              NaN
    # 1 2017-09-02 00:56:27            3   79.060800      9          NaN        79.060800
    # 2 2018-01-04 09:25:05            7   23.076240      1          NaN        23.076240
    # 3 2016-11-27 23:44:55            6  102.801241     11          NaN       102.801241
    # 4 2017-01-29 12:19:55            5   88.824230      1          NaN        88.824230
    # 5 2017-04-15 15:16:09            6  214.168659      4          NaN       214.168659
    # 6 2017-09-07 08:12:45            9   97.607635      9    97.607635              NaN
    # 7 2017-12-31 15:35:36           13  141.467249     12   141.467249              NaN
    # 8 2017-04-21 23:01:44           13  156.246854      4   156.246854              NaN
    # 9 2016-12-22 09:27:49            2   67.646662     12          NaN        67.646662
    
    # WHERE AND GROUPBY
    agg_df = pandas_df[pandas_df['date'] >= (dt.datetime.today() - dt.timedelta(days=60))]\
                       .groupby('month')[['volume_peak', 'volume_off_peak']].agg('mean')
    
    print(agg_df)
    #        volume_peak  volume_off_peak
    # month                              
    # 1        62.597999        23.076240
    # 11       37.775000        17.075594
    # 12      141.063694        29.986261
    

    SQL (使用从上面的代码导入csv的MS Access)

    SELECT Month(p.date) As month, 
           AVG(IIF(p.hour_ending >7 and p.hour_ending < 24, volume, NULL)) as volume_peak,
           AVG(IIF(p.hour_ending <=7 or p.hour_ending = 24, volume, NULL)) as volume_off_peak
    FROM csv_data p
    WHERE p.Date >= DateAdd('m', -2, Date())
    GROUP BY Month(p.date)
    
    -- month                peak            off_peak
    --     1    62.5979990645683    23.0762401295465
    --    11    37.7750002748325    17.0755937444385
    --    12    141.063693957234    29.9862605960166
    

答案 1 :(得分:1)

简短的回答是否定的,我没有直接的方法来使用SQL Alchemy来操纵pandas DataFrame。

如果你真的想使用SQL,我认为最简单的方法是编写一些辅助函数,使用像SQLite这样的数据库驱动程序将pandas DataFrame转换为SQL表格。

然而,作为一名精通SQL的开发人员,他最近不得不在pandas DataFrame中处理一些数据,我强烈建议您只是学习大熊猫的做事方式。它不像SQL那样具有声明性 - 通常SQL中的单个查询必须分解为pandas中的几个步骤。然而,我发现工作大熊猫很快就会变得容易,学习过程也是有益和令人满意的。