熊猫在大型数据集上应用速度。

时间:2016-03-28 17:31:17

标签: python sql-server pandas lambda

我在pandas中有一个表,它有两列,QuarterHourDimIDStartDateDimID;这些列为每个日期/季度时间配对提供了一个ID。例如,对于2015年1月1日,下午12:15,StartDateDimID将等于1097QuarterHourDimID将等于26。这就是我阅读的数据的组织方式。

这是我使用pyodbcpandas.read_sql(),约450M行和~60列阅读的大表,因此性能是一个问题。

要将QuarterHourDimIDStartDateDimID列解析为可行的datetime索引,我会在每一行上运行一个apply函数,以创建一个额外的列datetime。< / p>

我的代码在没有额外解析的情况下读取表格大约是800毫秒;但是当我运行这个apply函数时,它会在总运行时间上增加大约4秒(预期查询的频率在5.8-6s之间。)返回的df大约是45K行和5列(~450days * ~100quarter) -Hour-份)

我希望能更有效地重写我所写的内容,并在此过程中获得任何意见。

以下是我迄今为止编写的代码:

import pandas as pd
from datetime import datetime, timedelta
import pyodbc

def table(network, demo):
    connection_string = "DRIVER={SQL Server};SERVER=OURSERVER;DATABASE=DB"
    sql = """SELECT [ID],[StartDateDimID],[DemographicGroupDimID],[QuarterHourDimID],[Impression] FROM TABLE_NAME
        WHERE (MarketDimID = 1
        AND RecordTypeDimID = 2
        AND EstimateTypeDimID = 1
        AND DailyOrWeeklyDimID = 1
        AND RecordSequenceCodeDimID = 5
        AND ViewingTypeDimID = 4
        AND NetworkDimID = {}
        AND DemographicGroupDimID = {}
        AND QuarterHourDimID IS NOT NULL)""".format(network, demo)

    with pyodbc.connect(connection_string) as cnxn:
        df = pd.read_sql(sql=sql, con=cnxn, index_col=None)


    def time_map(quarter_hour, date):
        if quarter_hour > 72:
            return date + timedelta(minutes=(quarter_hour % 73)*15)
        return date + timedelta(hours=6, minutes=(quarter_hour-1)*15)

    map_date  = {}

    init_date = datetime(year=2012, month=1, day=1)

    for x in df.StartDateDimID.unique():
        map_date[x] = init_date + timedelta(days=int(x)-1)

    #this is the part of my code that is likely bogging things down
    df['datetime'] = df.apply(lambda row: time_map(int(row['QuarterHourDimID']),
                                                   map_date[row['StartDateDimID']]),
                                                   axis=1)
    if network == 1278:
        df = df.loc[df.groupby('datetime')['Impression'].idxmin()]

    df = df.set_index(['datetime'])

    return df

1 个答案:

答案 0 :(得分:0)

只是发布一个在SQL而不是pandas和时间模型中执行的日期时转换的示例,使用上面的代码并产生6.4s /执行的平均时间,我能够重写代码完全在SQL中,平均时间为640毫秒/执行。

更新的代码:

import pandas as pd
import pyodbc

SQL_QUERY ="""
SELECT [Impressions] = MIN(naf.Impression), [datetime] = DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey))) 
FROM [dbo].[NielsenAnalyticsFact] AS naf
LEFT JOIN [dbo].[DateDim] AS ddt
ON naf.StartDateDimID = ddt.DateDimID
LEFT JOIN [dbo].[TimeDim] as td
ON naf.QuarterHourDimID = td.TimeDimID
WHERE (naf.NielsenMarketDimID = 1
    AND naf.RecordTypeDimID = 2
    AND naf.AudienceEstimateTypeDimID = 1
    AND naf.DailyOrWeeklyDimID = 1
    AND naf.RecordSequenceCodeDimID = 5
    AND naf.ViewingTypeDimID = 4
    AND naf.NetworkDimID = 1278
    AND naf.DemographicGroupDimID = 3
    AND naf.QuarterHourDimID IS NOT NULL)
GROUP BY DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey)))
ORDER BY DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey))) ASC
"""

%%timeit -n200
with pyodbc.connect(DB_CREDENTIALS) as cnxn:
    df = pd.read_sql(sql=SQL_QUERY,
            con=cnxn,
            index_col=None)
200 loops, best of 3: 613 ms per loop