我有一个非常大的csv文件,我在pandas中打开如下....
import pandas
df = pandas.read_csv('large_txt_file.txt')
一旦我这样做,我的内存使用量增加了2GB,这是预期的,因为这个文件包含数百万行。当我需要释放这个内存时,我的问题出现了。我跑了......
del df
但是,我的内存使用率没有下降。这是释放熊猫数据帧所使用的内存的错误方法吗?如果是,那么正确的方法是什么?
答案 0 :(得分:68)
在Python中减少内存使用很困难,因为Python does not actually release memory back to the operating system。如果删除对象,则内存可用于新的Python对象,但不能free()
返回系统(see this question)。
如果您坚持使用数字numpy数组,那么这些数组将被释放,但是盒装对象不会被释放。
>>> import os, psutil, numpy as np
>>> def usage():
... process = psutil.Process(os.getpid())
... return process.get_memory_info()[0] / float(2 ** 20)
...
>>> usage() # initial memory usage
27.5
>>> arr = np.arange(10 ** 8) # create a large array without boxing
>>> usage()
790.46875
>>> del arr
>>> usage()
27.52734375 # numpy just free()'d the array
>>> arr = np.arange(10 ** 8, dtype='O') # create lots of objects
>>> usage()
3135.109375
>>> del arr
>>> usage()
2372.16796875 # numpy frees the array, but python keeps the heap big
Python将我们的记忆保持在高水位线,但我们可以减少我们创建的数据帧总数。修改数据框时,请选择inplace=True
,因此不要创建副本。
另一个常见问题是在ipython中保留以前创建的数据帧的副本:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'foo': [1,2,3,4]})
In [3]: df + 1
Out[3]:
foo
0 2
1 3
2 4
3 5
In [4]: df + 2
Out[4]:
foo
0 3
1 4
2 5
3 6
In [5]: Out # Still has all our temporary DataFrame objects!
Out[5]:
{3: foo
0 2
1 3
2 4
3 5, 4: foo
0 3
1 4
2 5
3 6}
您可以通过输入%reset Out
来清除历史记录来解决此问题。或者,您可以调整ipython与ipython --cache-size=5
保持的历史记录(默认值为1000)。
尽可能避免使用对象dtypes。
>>> df.dtypes
foo float64 # 8 bytes per value
bar int64 # 8 bytes per value
baz object # at least 48 bytes per value, often more
带有对象dtype的值被装箱,这意味着numpy数组只包含一个指针,并且堆上有一个完整的Python对象,用于数据帧中的每个值。这包括字符串。
虽然numpy支持数组中固定大小的字符串,但pandas不支持(it's caused user confusion)。这可能会产生重大影响:
>>> import numpy as np
>>> arr = np.array(['foo', 'bar', 'baz'])
>>> arr.dtype
dtype('S3')
>>> arr.nbytes
9
>>> import sys; import pandas as pd
>>> s = pd.Series(['foo', 'bar', 'baz'])
dtype('O')
>>> sum(sys.getsizeof(x) for x in s)
120
您可能希望避免使用字符串列,或者找到将字符串数据表示为数字的方法。
如果您的数据框包含许多重复值(NaN很常见),那么您可以使用sparse data structure来减少内存使用量:
>>> df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo float64
dtypes: float64(1)
memory usage: 605.5 MB
>>> df1.shape
(39681584, 1)
>>> df1.foo.isnull().sum() * 100. / len(df1)
20.628483479893344 # so 20% of values are NaN
>>> df1.to_sparse().info()
<class 'pandas.sparse.frame.SparseDataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo float64
dtypes: float64(1)
memory usage: 543.0 MB
您可以查看内存使用情况(docs):
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 14 columns):
...
dtypes: datetime64[ns](1), float64(8), int64(1), object(4)
memory usage: 4.4+ GB
从pandas 0.17.1开始,您还可以df.info(memory_usage='deep')
查看内存使用情况,包括对象。
答案 1 :(得分:22)
如评论中所述,有一些事情要尝试:SETLOCAL
SET "sourcedir=Y:\PriCat\Debiteuren"
SET "destindir=Y:\Pricat\Pricatombouw"
set source="Y:\Pricat\Pricatombouw"
set target="Y:\Pricat\XMLOmbouw\nieuw"
set var1=0
FOR /f "delims=" %%a IN (
'dir /b /ad /o-d "%sourcedir%\*" '
) DO XCopy /y /s "%sourcedir%\%%a\*.xml" "%destindir%\" &GOTO done
:done
(@EdChum)可能会清除内容。至少从我的经验来看,这些东西有时是有效的,而且往往不是。
然而,有一件事总是有效的,因为它是在操作系统上完成的,而不是语言级别。
假设您有一个创建中间巨大DataFrame的函数,并返回一个较小的结果(也可能是一个DataFrame):
gc.collect
然后,如果您执行类似
的操作def huge_intermediate_calc(something):
...
huge_df = pd.DataFrame(...)
...
return some_aggregate
然后the function is executed at a different process。该过程完成后,操作系统将重新获得其使用的所有资源。毫无疑问,Python,大熊猫,垃圾收集器,可以阻止它。
答案 2 :(得分:8)
这解决了为我释放内存的问题!!!
del [[df_1,df_2]]
gc.collect()
df_1=pd.DataFrame()
df_2=pd.DataFrame()
数据框将显式设置为null
答案 3 :(得分:2)
似乎glibc存在一个问题,会影响Pandas中的内存分配:https://github.com/pandas-dev/pandas/issues/2659
monkey patch detailed on this issue为我解决了这个问题:
# monkeypatches.py
# Solving memory leak problem in pandas
# https://github.com/pandas-dev/pandas/issues/2659#issuecomment-12021083
import pandas as pd
from ctypes import cdll, CDLL
try:
cdll.LoadLibrary("libc.so.6")
libc = CDLL("libc.so.6")
libc.malloc_trim(0)
except (OSError, AttributeError):
libc = None
__old_del = getattr(pd.DataFrame, '__del__', None)
def __new_del(self):
if __old_del:
__old_del(self)
libc.malloc_trim(0)
if libc:
print('Applying monkeypatch for pd.DataFrame.__del__', file=sys.stderr)
pd.DataFrame.__del__ = __new_del
else:
print('Skipping monkeypatch for pd.DataFrame.__del__: libc or malloc_trim() not found', file=sys.stderr)
答案 4 :(得分:1)
<div class="whitebox_wrapper">
<div class="index_whitebox">
<div class="index_title">
title
</div>
<div class="index_image">
</div>
<div class="index_article">
<div class="index_first_cell_text_wrapper">
</div>
</div>
<div id="index_nav_container">
<a href="contact.html" class="navButton_2">Contact</a>
</div>
</div>
</div>
有任何引用,则不会删除 del df
。因此,您需要使用df
删除对它的所有引用以释放内存。
因此,应删除绑定到df的所有实例以触发垃圾回收。
使用objgragh检查哪些物体正在抓住物体。
答案 5 :(得分:0)
这是我为解决这个问题所做的工作。
我有一个小型应用程序,它将大型数据集读入 Pandas 数据帧并将其用作 api。然后,用户可以通过将查询参数传递到 api 来查询数据帧。当用户读取了多个数据集时,应用程序显然面临内存使用限制。
不是将数据集读入单个数据框变量,而是将它们读入数据框字典。
df_file_contents[file_name] = pd.read_csv(..)
前端已经提供了一个api来清除字典。这将调用字典的 clear() 方法。可以自定义在 sys.getsizeof(df_file_contents) 达到一定大小时调用或者可以用来删除某些键。
df_file_contents.clear()
答案 6 :(得分:-5)
import matplotlib as plt
import matplotlib.pyplot as plt
import datetime as dt
import pandas as pd
import psycopg2
import pandas.io.sql as psql
conn = psycopg2.connect("dbname='postgres' user='user' host='100.10.20.600' password='Password'")
dataframe = psql.read_sql("""SELECT * FROM "schema"."dataset_name" """, conn)
pd.read_table(filename) - From a delimited text file (like TSV)
pd.read_excel(filename) - From an Excel file
pd.read_sql(query, connection_object) - Reads from a SQL table/database
pd.read_json(json_string) - Reads from a JSON file and extracts tables to a list of dataframes
df[col] or df.col- Returns column with label col as Series
df[[col1, col2]] - Returns Columns as a new DataFrame
s.iloc[0] - Selection by position/Integer-based indexing
s.loc[0] - Selection by index/label-based indexing
df.loc[:, :] and df.iloc[:, :] - First argument represents the number of rows and the second for columns
df.ix[0:a, 0:b] - Arguments notation is same as above but returns a rows and (b-1) columns [deprecated in Python 3]
df.loc[0:4,['App','Category']]
Data Cleaning
df.drop([col1, col2, col3], inplace = True, axis=1) - Remove set of column(s)
df.columns = ['a','b','c'] - Renames columns
df.isnull() - Checks for null Values, Returns Boolean DataFrame
df.isnull().any() - Returns boolean value for each column, gives True if any null value detected corresponding to that column
df.dropna() - Drops all rows that contain null values
df.dropna(axis=1) - Drops all columns that contain null values
df.fillna(x) - Replaces all null values with x
s.replace(1,'one') - Replaces all values equal to 1 with 'one'
s.replace([1,3], ['one','three']) - Replaces all 1 with 'one' and 3 with 'three'
df.rename(columns = lambda x: x + '_1') - Mass renaming of columns
df.rename(columns = {'old_name': 'new_name'}) - Selective renaming
df.rename(index = lambda x: x + 1) - Mass renaming of index
df[new_col] = df.col1 + ', ' + df.col2 - Add two columns to create a new column in the same DataFrame
FIlter sort
df[df[col] > 0.5] - Rows where the values in col > 0.5
df[(df[col] > 0.5) & (df[col] < 0.7)] - Rows where 0.7 > col > 0.5
df.sort_values(col1) - Sorts values by col1 in ascending order
df.sort_values(col2,ascending=False) - Sorts values by col2 in descending order
df.sort_values([col1,col2],ascending=[True,False]) - Sorts values by col1 in ascending order then col2 in descending order
df.groupby(col) - Returns a groupby object for values from one column
df.groupby([col1,col2]) - Returns a groupby object values from multiple columns
df.groupby(col1)[col2].mean() - (Aggregation) Returns the mean of the values in col2, grouped by the values in col1
df.pivot_table(index=col1,values=[col2,col3],aggfunc=mean) - Creates a pivot table that groups by col1 and calculates the mean of col2 and col3
df.apply(np.mean) - Applies a function across each column
df.apply(np.max, axis=1) - Applies a function across each row
df.applymap(lambda arg(s): expression) - Apply the expression on each value of the DataFrame
df[col].map(lambda arg(s): expression) - Apply the expression on each value of the column col
swapcase - Swaps the case lower/upper.
lower() / upper() - Converts strings in the Series/Index to lower / upper case.
len() - Computes String length.
strip() - Helps strip whitespace(including newline) from each string in the Series/index from both the sides.
split(' ') - Splits each string with the given pattern.
cat(sep=' ') - Concatenates the series/index elements with given separator.
get_dummies() - Returns the DataFrame with One-Hot Encoded values.
contains(pattern) - Returns Boolean True for each element if the substring contains in the element, else False.
replace(a,b) - Replaces the value a with the value b.
repeat(value) - Repeats each element with specified number of times.
count(pattern) - Returns count of appearance of pattern in each element.
startswith(pattern) / endswith(pattern) - Returns true if the element in the Series/Index starts / ends with the pattern.
find(pattern) - Returns the first position of the first occurrence of the pattern. Returns -1 if not found.
findall(pattern) - Returns a list of all occurrence of the pattern.
islower() / isupper() / isnumeric() - Checks whether all characters in each string in the Series/Index in lower / upper case / numeric or not. Returns Boolean.
df1.append(df2) OR pd.concat([df1, df2], axis=0) - Adds the rows in df1 to the end of df2 (columns should be identical)
pd.concat([df1, df2], axis=1) - Adds the columns in df1 to the end of df2 (rows should be identical)
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True)
Stats
df.mean() - Returns the mean of all columns
df.corr() - Returns the correlation between columns in a DataFrame
df.count() - Returns the number of non-null values in each DataFrame column
df.max() - Returns the highest value in each column
df.min() - Returns the lowest value in each column
df.median() - Returns the median of each column
df.std() - Returns the standard deviation of each column
Date
dataframe['start_time']=pd.to_datetime(dataframe['start_time'])
dataframe['end_time']=pd.to_datetime(dataframe['end_time'])
ataframe['month']=dataframe['start_time'].dt.month_name()
dataframe['start_hour']=dataframe['start_time'].dt.hour
dataframe['start_hour']=dataframe['start_time'].dt.month_name()
print(pd.merge(left, right, on='subject_id', how='left'))
dd=dd.drop(dd[dd.Sport=='Swimming'].index)
df_merge3=df_merge4.groupby(['Name','Team']).agg({'Event':'count','Medal':'count'}).reset_index().sort_values(by=['Event','Medal'],ascending=[False,True])