Question

我试图在Pandas Dataframe中读取270万行，但遇到了内存问题（我猜）。奇怪的是当我监视服务器上的RAM使用时，python使用最大1.5 GB的免费8 GB（服务器上的总RAM为16 GB）。在相同的设置中，它可以轻松读取多达一百万行。

这里有什么问题？因为它不使用所有可用内存并且运行时行数较少，所以可能是因为某种程度上python的内存有限吗？

以下是有关设置的代码和一些信息;
Anaconda 1.4.3 with Python 2.7（32位）
带有一个Xeon处理器和16 GB RAM的Windows Server 同一台机器上的SQL服务器限制为4 GB RAM。

守则：

def ingest_sql(connection, nrows, alldata,refresh=False):
"""Ingests the SQL query related to the data_flag.
:param connection:
:param nrows: number of rows
:param Refresh:
:return: data frame to read data into.
"""

df = []
print 'alldata:',alldata

if alldata == 'True':
    print "Reading All Data"

    print 'Select te.evtdescr, te.Ref_Badge_ID, te.Ref_Reader_ID, tr.SITE_ID AS SiteID, tb.id AS badgeid, te.event_time_utc, te.empid, te.cardnum, te.eventid, tp.ID AS personid, tp.NAME, tb.BADGENO '+ \
          'FROM TBL_EVENTS_HISTORY te INNER JOIN TBL_Badges tb ON te.Ref_Badge_ID = tb.ID ' + \
          'INNER JOIN TBL_PERSONS tp ON tb.PERSONID = tp.ID ' + \
          'INNER JOIN TBL_READERS tr ON te.Ref_Reader_ID = tr.ID ' + \
          'WHERE empid>0 AND eventid<2 AND ' + \
          'Ref_Badge_ID IS NOT NULL and ' + \
          'Ref_Reader_ID IS NOT NULL ORDER BY event_time_utc',

    df = pd.read_sql_query('Select te.evtdescr, te.Ref_Badge_ID, te.Ref_Reader_ID, tr.SITE_ID AS SiteID, tb.id AS badgeid, te.event_time_utc, te.empid, te.cardnum, te.eventid, tp.ID AS personid, tp.NAME, tb.BADGENO '+
                            'FROM TBL_EVENTS_HISTORY te INNER JOIN TBL_Badges tb ON te.Ref_Badge_ID = tb.ID ' +
                            'INNER JOIN TBL_PERSONS tp ON tb.PERSONID = tp.ID '+
                            'INNER JOIN TBL_READERS tr ON te.Ref_Reader_ID = tr.ID '+
                            'WHERE empid>0 AND eventid<2 AND '+
                            'Ref_Badge_ID IS NOT NULL and '+
                            'Ref_Reader_ID IS NOT NULL ORDER BY event_time_utc',
                           connection)
else:
    print 'Alldata is False'
    print "Reading only "+ nrows + " rows"
    print 'Select top ' + str(nrows) + ' te.evtdescr, te.Ref_Badge_ID, te.Ref_Reader_ID, tr.SITE_ID AS SiteID, tb.id AS badgeid, te.event_time_utc, te.empid, te.cardnum, te.eventid, tp.ID AS personid, tp.NAME, tb.BADGENO '+ \
          'FROM TBL_EVENTS_HISTORY te INNER JOIN TBL_Badges tb ON te.Ref_Badge_ID = tb.ID ' +\
            'INNER JOIN TBL_PERSONS tp ON tb.PERSONID = tp.ID ' +\
            'INNER JOIN TBL_READERS tr ON te.Ref_Reader_ID = tr.ID ' +\
            'WHERE empid>0 AND eventid<2 AND ' +\
            'Ref_Badge_ID IS NOT NULL and ' +\
            'Ref_Reader_ID IS NOT NULL ORDER BY event_time_utc',

    df = pd.read_sql_query('Select top '+ str(nrows) +\
                           ' te.evtdescr, te.Ref_Badge_ID, te.Ref_Reader_ID, tr.SITE_ID AS SiteID, tb.id AS badgeid, te.event_time_utc, te.empid, te.cardnum, te.eventid, tp.ID AS personid, tp.NAME, tb.BADGENO '+
                           'FROM TBL_EVENTS_HISTORY te INNER JOIN TBL_Badges tb ON te.Ref_Badge_ID = tb.ID ' +
                           'INNER JOIN TBL_PERSONS tp ON tb.PERSONID = tp.ID ' +
                           'INNER JOIN TBL_READERS tr ON te.Ref_Reader_ID = tr.ID ' +
                           'WHERE empid>0 AND eventid<2 AND ' +
                           'Ref_Badge_ID IS NOT NULL and ' +
                           'Ref_Reader_ID IS NOT NULL ORDER BY event_time_utc',
                          connection)

return df

这是错误：

  global start_time
" the MASTER GLUE FUNCTION
pandas imported
all external packages imporated
WIC: future imported
banana phone
DRIVER={SQL Server};SERVER=10.180.10.67;DATABASE=SAFEANALYTICS;UID=safeapp;PWD=safeapp
winter is coming imported
TBL_READERS
ID
Starting:
full_run: True
date_flag is False
alldata: True
Reading All Data
Select te.evtdescr, te.Ref_Badge_ID, te.Ref_Reader_ID, tr.SITE_ID AS SiteID, tb.id AS badgeid, te.event_time_utc, te.empid, te.cardnum, te.eventid, tp.ID AS personid, tp.NAME, tb.BADGENO FROM TBL_EVENTS_HISTORY te INNER JOIN TBL_Badges tb ON te.Ref_Badge_ID = tb.ID INNER JOIN TBL_PERSONS tp ON tb.PERSONID = tp.ID INNER JOIN TBL_READERS tr ON te.Ref_Reader_ID = tr.ID WHERE empid>0 AND eventid<2 AND Ref_Badge_ID IS NOT NULL and Ref_Reader_ID IS NOT NULL ORDER BY event_time_utc
Traceback (most recent call last):
  File "C:\Transfer\Project\VARYS_DRS_02232017\Calculate_Risk.py", line 141, in <module>
    make_risk_tables(dev=args.dev,nrows_0=args.nrows_0,nrows=args.nrows,dataflag=args.data_flag,all_data=True)
  File "C:\Transfer\Project\VARYS_DRS_02232017\Calculate_Risk.py", line 35, in make_risk_tables
    WINterIsComing_with_devid.WinVarys(nrows=nrows_0,data_flag=dataflag,refresh=dev,alldata=all_data)
  File "C:\Transfer\Project\VARYS_DRS_02232017\WINterIsComing_with_devid.py", line 151, in WinVarys
    df = read_columns_into_df(data_flag, df)
  File "C:\Transfer\Project\VARYS_DRS_02232017\WINterIsComing_with_devid.py", line 112, in read_columns_into_df
    df=df.drop_duplicates()
  File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\util\decorators.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\frame.py", line 3138, in drop_duplicates
    duplicated = self.duplicated(subset, keep=keep)
  File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\util\decorators.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\frame.py", line 3188, in duplicated
    labels, shape = map(list, zip(*map(f, vals)))
  File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\frame.py", line 3177, in f
    _SIZE_HINT_LIMIT))
  File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\algorithms.py", line 313, in factorize
    labels = table.get_labels(vals, uniques, 0, na_sentinel, True)
  File "pandas\src\hashtable_class_helper.pxi", line 839, in pandas.hashtable.PyObjectHashTable.get_labels (pandas\hashtable.c:15395)
MemoryError
[Finished in 58.3s]

Answer 1

就像Paul建议将python 2.7 32位升级到64位工作一样。我不完全确定它为什么会起作用，但使用Microsoft Visual C ++ Compiler for Python编译使用64位python的Cython代码很困难。所以不得不删除Cython代码。

带有read_sql_query的Pandas MemoryError

1 个答案: