Question

我有一台装有两个磁盘的电脑：

110GB SSD
1TB HDD

SSD中有大约18GB的免费空间。

当我运行下面的python代码时，它“使用”我SSD中的所有空间（我最终只有1GB空闲）。此代码迭代文件夹中的所有SAS文件，按操作执行分组，并将每个文件的结果附加到一个大数据帧。

import pandas as pd
import os
import datetime
import numpy as np

#The function GetDailyPricePoints does the following:
#1. Imports file
#2. Creates "price" variable
#3. Performs a group by
#4. Decode byte variables and convert salesdate to date type (if needed)

def GetDailyPricePoints(inpath,infile):
    intable = pd.read_sas(filepath_or_buffer=os.path.join(inpath,infile))

    #Create price column
    intable.loc[intable['quantity']!=0,'price'] = intable['salesvalue']/intable['quantity']
    intable['price'] = round(intable['price'].fillna(0.0),0)

    #Create outtable
    outtable = intable.groupby(["salesdate", "storecode", "price", "barcode"]).agg({'key_row':'count', 'salesvalue':'sum', 'quantity':'sum'}).reset_index().rename(columns = {'key_row':'Baskets', 'salesvalue':'Sales', 'quantity':'Quantity'})

    #Fix byte values and salesdate column
    for column in outtable:
        if not column in list(outtable.select_dtypes(include=[np.number]).columns.values): #loop non-numeric columns
            outtable[column] = outtable[column].where(outtable[column].apply(type) != bytes, outtable[column].str.decode('utf-8'))
        elif column=='salesdate': #numeric column and name is salesdate
            outtable[column] = pd.to_timedelta(outtable[column], unit='D') + pd.Timestamp('1960-1-1')

    return outtable


inpath =  r'C:\Users\admin\Desktop\Transactions'
outpath = os.getcwd() + '\Export'
outfile =  'DailyPricePoints'

dirs = os.listdir(inpath)
outtable = pd.DataFrame()

#loop through SAS files in folder
for file in dirs:
    if file[-9:] == '.sas7bdat':
        outtable.append(GetDailyPricePoints(inpath,file,decimals))

我想了解使用磁盘空间究竟是什么。此外，我想将保存此“临时工作”的路径更改为HDD中的路径。

Answer 1

您正在将所有数据复制到RAM中;在这种情况下你还没有足够的东西，所以Python使用的是页面文件或虚拟内存。解决这个问题的唯一方法是获得更多内存，或者你可能不会将所有内容存储在一个大数据帧中，例如：使用outtable.to_pickle('csvfile.csv')将每个文件写入一个pickle。

但是，如果您坚持将所有内容存储在一个大型csv中，则可以通过传递文件对象作为第一个参数来附加到csv：

out = open('out.csv', 'a')
outtable.to_csv(out, index = False)

在循环中执行.to_csv()步骤。

此外，数据框的.append()方法不会修改数据框，而是返回一个新的数据框（与带有列表的方法不同）。所以你的最后一段代码可能并没有做你期望的事情。

为什么pandas python使用磁盘空间

1 个答案: