我在pandas中有一个表,它有两列,QuarterHourDimID
和StartDateDimID
;这些列为每个日期/季度时间配对提供了一个ID。例如,对于2015年1月1日,下午12:15,StartDateDimID
将等于1097
,QuarterHourDimID
将等于26
。这就是我阅读的数据的组织方式。
这是我使用pyodbc
和pandas.read_sql()
,约450M行和~60列阅读的大表,因此性能是一个问题。
要将QuarterHourDimID
和StartDateDimID
列解析为可行的datetime
索引,我会在每一行上运行一个apply函数,以创建一个额外的列datetime
。< / p>
我的代码在没有额外解析的情况下读取表格大约是800毫秒;但是当我运行这个apply函数时,它会在总运行时间上增加大约4秒(预期查询的频率在5.8-6s之间。)返回的df
大约是45K行和5列(~450days * ~100quarter) -Hour-份)
我希望能更有效地重写我所写的内容,并在此过程中获得任何意见。
以下是我迄今为止编写的代码:
import pandas as pd
from datetime import datetime, timedelta
import pyodbc
def table(network, demo):
connection_string = "DRIVER={SQL Server};SERVER=OURSERVER;DATABASE=DB"
sql = """SELECT [ID],[StartDateDimID],[DemographicGroupDimID],[QuarterHourDimID],[Impression] FROM TABLE_NAME
WHERE (MarketDimID = 1
AND RecordTypeDimID = 2
AND EstimateTypeDimID = 1
AND DailyOrWeeklyDimID = 1
AND RecordSequenceCodeDimID = 5
AND ViewingTypeDimID = 4
AND NetworkDimID = {}
AND DemographicGroupDimID = {}
AND QuarterHourDimID IS NOT NULL)""".format(network, demo)
with pyodbc.connect(connection_string) as cnxn:
df = pd.read_sql(sql=sql, con=cnxn, index_col=None)
def time_map(quarter_hour, date):
if quarter_hour > 72:
return date + timedelta(minutes=(quarter_hour % 73)*15)
return date + timedelta(hours=6, minutes=(quarter_hour-1)*15)
map_date = {}
init_date = datetime(year=2012, month=1, day=1)
for x in df.StartDateDimID.unique():
map_date[x] = init_date + timedelta(days=int(x)-1)
#this is the part of my code that is likely bogging things down
df['datetime'] = df.apply(lambda row: time_map(int(row['QuarterHourDimID']),
map_date[row['StartDateDimID']]),
axis=1)
if network == 1278:
df = df.loc[df.groupby('datetime')['Impression'].idxmin()]
df = df.set_index(['datetime'])
return df
答案 0 :(得分:0)
只是发布一个在SQL而不是pandas和时间模型中执行的日期时转换的示例,使用上面的代码并产生6.4s /执行的平均时间,我能够重写代码完全在SQL中,平均时间为640毫秒/执行。
更新的代码:
import pandas as pd
import pyodbc
SQL_QUERY ="""
SELECT [Impressions] = MIN(naf.Impression), [datetime] = DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey)))
FROM [dbo].[NielsenAnalyticsFact] AS naf
LEFT JOIN [dbo].[DateDim] AS ddt
ON naf.StartDateDimID = ddt.DateDimID
LEFT JOIN [dbo].[TimeDim] as td
ON naf.QuarterHourDimID = td.TimeDimID
WHERE (naf.NielsenMarketDimID = 1
AND naf.RecordTypeDimID = 2
AND naf.AudienceEstimateTypeDimID = 1
AND naf.DailyOrWeeklyDimID = 1
AND naf.RecordSequenceCodeDimID = 5
AND naf.ViewingTypeDimID = 4
AND naf.NetworkDimID = 1278
AND naf.DemographicGroupDimID = 3
AND naf.QuarterHourDimID IS NOT NULL)
GROUP BY DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey)))
ORDER BY DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey))) ASC
"""
%%timeit -n200
with pyodbc.connect(DB_CREDENTIALS) as cnxn:
df = pd.read_sql(sql=SQL_QUERY,
con=cnxn,
index_col=None)
200 loops, best of 3: 613 ms per loop