如何释放pandas数据帧使用的内存?

时间:2016-08-23 12:17:10

标签: python pandas memory

我有一个非常大的csv文件,我在pandas中打开如下....

import pandas
df = pandas.read_csv('large_txt_file.txt')

一旦我这样做,我的内存使用量增加了2GB,这是预期的,因为这个文件包含数百万行。当我需要释放这个内存时,我的问题出现了。我跑了......

del df

但是,我的内存使用率没有下降。这是释放熊猫数据帧所使用的内存的错误方法吗?如果是,那么正确的方法是什么?

7 个答案:

答案 0 :(得分:68)

在Python中减少内存使用很困难,因为Python does not actually release memory back to the operating system。如果删除对象,则内存可用于新的Python对象,但不能free()返回系统(see this question)。

如果您坚持使用数字numpy数组,那么这些数组将被释放,但是盒装对象不会被释放。

>>> import os, psutil, numpy as np
>>> def usage():
...     process = psutil.Process(os.getpid())
...     return process.get_memory_info()[0] / float(2 ** 20)
... 
>>> usage() # initial memory usage
27.5 

>>> arr = np.arange(10 ** 8) # create a large array without boxing
>>> usage()
790.46875
>>> del arr
>>> usage()
27.52734375 # numpy just free()'d the array

>>> arr = np.arange(10 ** 8, dtype='O') # create lots of objects
>>> usage()
3135.109375
>>> del arr
>>> usage()
2372.16796875  # numpy frees the array, but python keeps the heap big

减少数据帧数

Python将我们的记忆保持在高水位线,但我们可以减少我们创建的数据帧总数。修改数据框时,请选择inplace=True,因此不要创建副本。

另一个常见问题是在ipython中保留以前创建的数据帧的副本:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'foo': [1,2,3,4]})

In [3]: df + 1
Out[3]: 
   foo
0    2
1    3
2    4
3    5

In [4]: df + 2
Out[4]: 
   foo
0    3
1    4
2    5
3    6

In [5]: Out # Still has all our temporary DataFrame objects!
Out[5]: 
{3:    foo
 0    2
 1    3
 2    4
 3    5, 4:    foo
 0    3
 1    4
 2    5
 3    6}

您可以通过输入%reset Out来清除历史记录来解决此问题。或者,您可以调整ipython与ipython --cache-size=5保持的历史记录(默认值为1000)。

减少数据帧大小

尽可能避免使用对象dtypes。

>>> df.dtypes
foo    float64 # 8 bytes per value
bar      int64 # 8 bytes per value
baz     object # at least 48 bytes per value, often more

带有对象dtype的值被装箱,这意味着numpy数组只包含一个指针,并且堆上有一个完整的Python对象,用于数据帧中的每个值。这包括字符串。

虽然numpy支持数组中固定大小的字符串,但pandas不支持(it's caused user confusion)。这可能会产生重大影响:

>>> import numpy as np
>>> arr = np.array(['foo', 'bar', 'baz'])
>>> arr.dtype
dtype('S3')
>>> arr.nbytes
9

>>> import sys; import pandas as pd
>>> s = pd.Series(['foo', 'bar', 'baz'])
dtype('O')
>>> sum(sys.getsizeof(x) for x in s)
120

您可能希望避免使用字符串列,或者找到将字符串数据表示为数字的方法。

如果您的数据框包含许多重复值(NaN很常见),那么您可以使用sparse data structure来减少内存使用量:

>>> df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo    float64
dtypes: float64(1)
memory usage: 605.5 MB

>>> df1.shape
(39681584, 1)

>>> df1.foo.isnull().sum() * 100. / len(df1)
20.628483479893344 # so 20% of values are NaN

>>> df1.to_sparse().info()
<class 'pandas.sparse.frame.SparseDataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo    float64
dtypes: float64(1)
memory usage: 543.0 MB

查看内存使用情况

您可以查看内存使用情况(docs):

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 14 columns):
...
dtypes: datetime64[ns](1), float64(8), int64(1), object(4)
memory usage: 4.4+ GB

从pandas 0.17.1开始,您还可以df.info(memory_usage='deep')查看内存使用情况,包括对象。

答案 1 :(得分:22)

如评论中所述,有一些事情要尝试:SETLOCAL SET "sourcedir=Y:\PriCat\Debiteuren" SET "destindir=Y:\Pricat\Pricatombouw" set source="Y:\Pricat\Pricatombouw" set target="Y:\Pricat\XMLOmbouw\nieuw" set var1=0 FOR /f "delims=" %%a IN ( 'dir /b /ad /o-d "%sourcedir%\*" ' ) DO XCopy /y /s "%sourcedir%\%%a\*.xml" "%destindir%\" &GOTO done :done (@EdChum)可能会清除内容。至少从我的经验来看,这些东西有时是有效的,而且往往不是。

然而,有一件事总是有效的,因为它是在操作系统上完成的,而不是语言级别。

假设您有一个创建中间巨大DataFrame的函数,并返回一个较小的结果(也可能是一个DataFrame):

gc.collect

然后,如果您执行类似

的操作
def huge_intermediate_calc(something):
    ...
    huge_df = pd.DataFrame(...)
    ...
    return some_aggregate

然后the function is executed at a different process。该过程完成后,操作系统将重新获得其使用的所有资源。毫无疑问,Python,大熊猫,垃圾收集器,可以阻止它。

答案 2 :(得分:8)

这解决了为我释放内存的问题!!!

del [[df_1,df_2]]
gc.collect()
df_1=pd.DataFrame()
df_2=pd.DataFrame()

数据框将显式设置为null

答案 3 :(得分:2)

似乎glibc存在一个问题,会影响Pandas中的内存分配:https://github.com/pandas-dev/pandas/issues/2659

monkey patch detailed on this issue为我解决了这个问题:

# monkeypatches.py

# Solving memory leak problem in pandas
# https://github.com/pandas-dev/pandas/issues/2659#issuecomment-12021083
import pandas as pd
from ctypes import cdll, CDLL
try:
    cdll.LoadLibrary("libc.so.6")
    libc = CDLL("libc.so.6")
    libc.malloc_trim(0)
except (OSError, AttributeError):
    libc = None

__old_del = getattr(pd.DataFrame, '__del__', None)

def __new_del(self):
    if __old_del:
        __old_del(self)
    libc.malloc_trim(0)

if libc:
    print('Applying monkeypatch for pd.DataFrame.__del__', file=sys.stderr)
    pd.DataFrame.__del__ = __new_del
else:
    print('Skipping monkeypatch for pd.DataFrame.__del__: libc or malloc_trim() not found', file=sys.stderr)

答案 4 :(得分:1)

如果删除时对<div class="whitebox_wrapper"> <div class="index_whitebox"> <div class="index_title"> title </div> <div class="index_image"> </div> <div class="index_article"> <div class="index_first_cell_text_wrapper"> </div> </div> <div id="index_nav_container"> <a href="contact.html" class="navButton_2">Contact</a> </div> </div> </div> 有任何引用,则不会删除

del df。因此,您需要使用df删除对它的所有引用以释放内存。

因此,应删除绑定到df的所有实例以触发垃圾回收。

使用objgragh检查哪些物体正在抓住物体。

答案 5 :(得分:0)

这是我为解决这个问题所做的工作。

我有一个小型应用程序,它将大型数据集读入 Pandas 数据帧并将其用作 api。然后,用户可以通过将查询参数传递到 api 来查询数据帧。当用户读取了多个数据集时,应用程序显然面临内存使用限制。

不是将数据集读入单个数据框变量,而是将它们读入数据框字典。

df_file_contents[file_name] = pd.read_csv(..)

前端已经提供了一个api来清除字典。这将调用字典的 clear() 方法。可以自定义在 sys.getsizeof(df_file_contents) 达到一定大小时调用或者可以用来删除某些键。

df_file_contents.clear()

答案 6 :(得分:-5)

import matplotlib as plt
import matplotlib.pyplot as plt
import datetime as dt



import pandas as pd 
import psycopg2
import pandas.io.sql as psql
conn = psycopg2.connect("dbname='postgres' user='user' host='100.10.20.600' password='Password'")
dataframe = psql.read_sql("""SELECT * FROM "schema"."dataset_name" """, conn)



pd.read_table(filename) - From a delimited text file (like TSV)
pd.read_excel(filename) - From an Excel file
pd.read_sql(query, connection_object) - Reads from a SQL table/database
pd.read_json(json_string) - Reads from a JSON file and extracts tables to a list of dataframes
df[col] or df.col- Returns column with label col as Series
df[[col1, col2]] - Returns Columns as a new DataFrame
s.iloc[0] - Selection by position/Integer-based indexing
s.loc[0] - Selection by index/label-based indexing
df.loc[:, :] and df.iloc[:, :] - First argument represents the number of rows and the second for columns
df.ix[0:a, 0:b] - Arguments notation is same as above but returns a rows and (b-1) columns [deprecated in Python 3]
df.loc[0:4,['App','Category']]

Data Cleaning

df.drop([col1, col2, col3], inplace = True, axis=1) - Remove set of column(s)
df.columns = ['a','b','c'] - Renames columns
df.isnull() - Checks for null Values, Returns Boolean DataFrame
df.isnull().any() - Returns boolean value for each column, gives True if any null value detected corresponding to that column
df.dropna() - Drops all rows that contain null values
df.dropna(axis=1) - Drops all columns that contain null values
df.fillna(x) - Replaces all null values with x
s.replace(1,'one') - Replaces all values equal to 1 with 'one'
s.replace([1,3], ['one','three']) - Replaces all 1 with 'one' and 3 with 'three'
df.rename(columns = lambda x: x + '_1') - Mass renaming of columns
df.rename(columns = {'old_name': 'new_name'}) - Selective renaming
df.rename(index = lambda x: x + 1) - Mass renaming of index
df[new_col] = df.col1 + ', ' + df.col2 - Add two columns to create a new column in the same DataFrame

FIlter sort

df[df[col] > 0.5] - Rows where the values in col > 0.5
df[(df[col] > 0.5) & (df[col] < 0.7)] - Rows where 0.7 > col > 0.5
df.sort_values(col1) - Sorts values by col1 in ascending order
df.sort_values(col2,ascending=False) - Sorts values by col2 in descending order
df.sort_values([col1,col2],ascending=[True,False]) - Sorts values by col1 in ascending order then col2 in descending order

df.groupby(col) - Returns a groupby object for values from one column

df.groupby([col1,col2]) - Returns a groupby object values from multiple columns
df.groupby(col1)[col2].mean() - (Aggregation) Returns the mean of the values in col2, grouped by the values in col1
df.pivot_table(index=col1,values=[col2,col3],aggfunc=mean) - Creates a pivot table that groups by col1 and calculates the mean of col2 and col3
df.apply(np.mean) - Applies a function across each column
df.apply(np.max, axis=1) - Applies a function across each row
df.applymap(lambda arg(s): expression) - Apply the expression on each value of the DataFrame
df[col].map(lambda arg(s): expression) - Apply the expression on each value of the column col


swapcase - Swaps the case lower/upper.
lower() / upper() - Converts strings in the Series/Index to lower / upper case.
len() - Computes String length.
strip() - Helps strip whitespace(including newline) from each string in the Series/index from both the sides.
split(' ') - Splits each string with the given pattern.
cat(sep=' ') - Concatenates the series/index elements with given separator.
get_dummies() - Returns the DataFrame with One-Hot Encoded values.
contains(pattern) - Returns Boolean True for each element if the substring contains in the element, else False.
replace(a,b) - Replaces the value a with the value b.
repeat(value) - Repeats each element with specified number of times.
count(pattern) - Returns count of appearance of pattern in each element.
startswith(pattern) / endswith(pattern) - Returns true if the element in the Series/Index starts / ends with the pattern.
find(pattern) - Returns the first position of the first occurrence of the pattern. Returns -1 if not found.
findall(pattern) - Returns a list of all occurrence of the pattern.
islower() / isupper() / isnumeric() - Checks whether all characters in each string in the Series/Index in lower / upper case / numeric or not. Returns Boolean.

df1.append(df2) OR pd.concat([df1, df2], axis=0) - Adds the rows in df1 to the end of df2 (columns should be identical)
pd.concat([df1, df2], axis=1) - Adds the columns in df1 to the end of df2 (rows should be identical)
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True)

Stats

df.mean() - Returns the mean of all columns
df.corr() - Returns the correlation between columns in a DataFrame
df.count() - Returns the number of non-null values in each DataFrame column
df.max() - Returns the highest value in each column
df.min() - Returns the lowest value in each column
df.median() - Returns the median of each column
df.std() - Returns the standard deviation of each column






Date


dataframe['start_time']=pd.to_datetime(dataframe['start_time'])
dataframe['end_time']=pd.to_datetime(dataframe['end_time'])

ataframe['month']=dataframe['start_time'].dt.month_name()
dataframe['start_hour']=dataframe['start_time'].dt.hour
dataframe['start_hour']=dataframe['start_time'].dt.month_name()

print(pd.merge(left, right, on='subject_id', how='left'))






dd=dd.drop(dd[dd.Sport=='Swimming'].index)

df_merge3=df_merge4.groupby(['Name','Team']).agg({'Event':'count','Medal':'count'}).reset_index().sort_values(by=['Event','Medal'],ascending=[False,True])