Python - 解析CSV文件夹时的大量内存消耗

时间:2016-08-05 12:21:31

标签: python python-3.x csv parsing memory

我有几个文件夹,每个文件夹都包含许多CSV文件。我需要解析它们以制作一些情节等。

问题是,内存消耗量巨大,之后它迅速增长到cca 3.4GB。我不知道这种行为的原因是什么,我以前从未遇到过与Python有关的内存问题。

我找到了memory profiler,它给了我这个输出:

Filename: /home/martin/PycharmProjects/readex-radar/fooVisualizer.py

Line #    Mem usage    Increment   Line Contents
================================================
   560  293.340 MiB    0.000 MiB   @profile
   561                             def getAllDataClassifiedFromFolder(measuredFuncFolderArg,
   562                                                                yLabelArg,
   563                                                                filenameArgs,
   564                                                                slideshowCreator,
   565                                                                samplesArgs=None):
   566
   567  293.340 MiB    0.000 MiB       sampleSourcesInds = {}
   568  293.340 MiB    0.000 MiB       summarySourcesAvg = {}
   569  293.340 MiB    0.000 MiB       summarySourcesAvgInds = {}
   570  293.340 MiB    0.000 MiB       summarySourcesFull = {}
   571    
   572 3342.426 MiB 3049.086 MiB       for dirpath, dirnames, filenames in os.walk(measuredFuncFolderArg, topdown=True):
   573                                     # Ignorovat skryte slozky kvuli GITu atd.
   574                                     # TODO mozna nebude treba
   575  293.340 MiB -3049.086 MiB           filenames = [f for f in filenames if not f[0] == '.']
   576  293.340 MiB    0.000 MiB           dirnames[:] = [d for d in dirnames if not d[0] == '.']
   577    
   578                                     ##########################
   579                                     # Parsovani jedne slozky #
   580                                     ##########################
   581    
   582                                     # Vsechna data z jedne slozky (funkce na urcitem radku)
   583  293.340 MiB    0.000 MiB           folderData = []
   584    
   585                                     # Nacitam parametry dane v nazvu CSV souboru
   586  293.340 MiB    0.000 MiB           funcLabelArg = filenameArgs.getFuncLabel()
   587  293.340 MiB    0.000 MiB           xLabelArg = filenameArgs.getXLabel()
   588  293.340 MiB    0.000 MiB           otherUserArgs = filenameArgs.getConfigLst()
   589    
   590                                     # 'Rozlozim' config argument na jednotlive hodnoty
   591  293.340 MiB    0.000 MiB           keyLst = filenameArgs.getLstOfParams()
   592    
   593 3382.207 MiB 3088.867 MiB           for filename in filenames:
   594                                         # Nactu data
   595 3382.207 MiB    0.000 MiB               p = LabeledCSVParser('{}/{}'.format(dirpath, filename))
   596 3382.207 MiB    0.000 MiB               p.parse()
   597 3382.207 MiB    0.000 MiB               data = p.getDicData()
   598    
   599                                         # Vytvorit a zapsat 'samples', pokud jsou zadany
   600                                         # parametrem 'samplesArgs'

我们可以看到,在第572和593行中,内存消耗急剧上升。你知道为什么吗?我想这是os.walk ...

所以,你以前见过这个吗?如果你有,请你告诉我,如何解决这个问题?

我试图在LabeledCSVParser对象中添加一个显式的析构函数,但它对内存的影响很小。此外,似乎parse()函数中没有显着的内存消耗。

所以,我会尝试再检查迭代文件。

编辑1

Filename: /home/martin/PycharmProjects/readex-radar/fooVisualizer.py   

Line #    Mem usage    Increment   Line Contents
================================================
   100 3271.395 MiB    0.000 MiB       @profile
   101                                 def parse(self):
   102                                     """
   103                                     Funkce pro parsovani 'ostitkovaneho' CSV.
   104                             
   105                                     Predpoklada CSV ve tvaru:
   106                             
   107                                         # Label 1
   108                                         data1, data2
   109                                         data3, data4
   110                             
   111                                         # Label 2
   112                                         data5, data6
   113                                         data7, data8
   114                                         ...
   115                             
   116                                     Labely se mohou opakovat, ziskane hodnoty
   117                                     se ulozi do ruznych listu ve slovniku __dicData,
   118                                     kde jejich spolecnym klicem bude label.
   119                                     """
   120                             
   121 3271.395 MiB    0.000 MiB           currentLabel = self.__parsedFile.readline().split('#')[1].strip()
   122 3271.395 MiB    0.000 MiB           dataBlock = list()
   123 3271.395 MiB    0.000 MiB           self.__dicData[currentLabel] = list()
   124                             
   125                                     # Ulozim soucasny dataBlock do __dicData pod klic currentLabel
   126                                     # - hodnoty se do tohoto dataBlocku zapisuji pozdeji diky 
   127                                     #   referenci
   128 3271.395 MiB    0.000 MiB           self.__dicData[currentLabel].append(dataBlock)
   129                             
   130 3272.949 MiB    1.555 MiB           for row in self.__parsedFile:
   131                                         # Kontrola, jestli se jedna o label nebo radek s daty
   132 3272.949 MiB    0.000 MiB               if row.__contains__('#'):
   133                             
   134                                             # Vytvorim novy dataBlock
   135 3271.395 MiB   -1.555 MiB                   dataBlock = list()
   136                             
   137                                             # Zisk nazvu labelu z radku
   138 3271.395 MiB    0.000 MiB                   tmpLabel = row.split('#')[1].strip()
   139                             
   140                                             # Label se stane 'aktualnim' -  nasledujici
   141                                             # data se budou zapisovat k nemu
   142 3271.395 MiB    0.000 MiB                   currentLabel = tmpLabel
   143                             
   144                                             # Pokud neni label 'zaevidovany', pridam
   145                                             # jej do __dicData jako klic
   146 3271.395 MiB    0.000 MiB                   if currentLabel not in self.__dicData.keys():
   147 3271.395 MiB    0.000 MiB                       self.__dicData[currentLabel] = list()
   148                             
   149 3271.395 MiB    0.000 MiB                   self.__dicData[currentLabel].append(dataBlock)
   150                                         else:
   151                                             # Pridam rozparsovany radek do aktualniho
   152                                             # bloku dat jako n-tici
   153 3272.949 MiB    1.555 MiB                   dataBlock.append(tuple(row.strip().split(',')))


Filename: /home/martin/PycharmProjects/readex-radar/fooVisualizer.py

Line #    Mem usage    Increment   Line Contents
================================================
   558  293.543 MiB    0.000 MiB   @profile
   559                             def getAllDataClassifiedFromFolder(measuredFuncFolderArg,
   560                                                                yLabelArg,
   561                                                                filenameArgs,
   562                                                                slideshowCreator,
   563                                                                samplesArgs=None):
   564                             
   565  293.543 MiB    0.000 MiB       sampleSourcesInds = {}
   566  293.543 MiB    0.000 MiB       summarySourcesAvg = {}
   567  293.543 MiB    0.000 MiB       summarySourcesAvgInds = {}
   568  293.543 MiB    0.000 MiB       summarySourcesFull = {}
   569                             
   570 2746.160 MiB 2452.617 MiB       for dirpath, dirnames, filenames in os.walk(measuredFuncFolderArg, topdown=True):
   571                                     # Ignorovat skryte slozky kvuli GITu atd.
   572                                     # TODO mozna nebude treba
   573                                     #filenames = [f for f in filenames if not f[0] == '.']
   574                                     #dirnames[:] = [d for d in dirnames if not d[0] == '.']
   575                             
   576                                     ##########################
   577                                     # Parsovani jedne slozky #
   578                                     ##########################
   579                             
   580                                     # Vsechna data z jedne slozky (funkce na urcitem radku)
   581  293.543 MiB -2452.617 MiB           folderData = []
   582                             
   583                                     # Nacitam parametry dane v nazvu CSV souboru
   584  293.543 MiB    0.000 MiB           funcLabelArg = filenameArgs.getFuncLabel()
   585  293.543 MiB    0.000 MiB           xLabelArg = filenameArgs.getXLabel()
   586  293.543 MiB    0.000 MiB           otherUserArgs = filenameArgs.getConfigLst()
   587                             
   588                                     # 'Rozlozim' config argument na jednotlive hodnoty
   589  293.543 MiB    0.000 MiB           keyLst = filenameArgs.getLstOfParams()
   590                             
   591  293.543 MiB    0.000 MiB           print('nacitam data')
   592 3271.395 MiB 2977.852 MiB           for filename in filenames:
   593                                         # Nactu data
   594 3271.395 MiB    0.000 MiB               p = LabeledCSVParser('{}/{}'.format(dirpath, filename))
   595 3271.395 MiB    0.000 MiB               p.parse()
   596 3271.395 MiB    0.000 MiB               data = p.getDicData()
   597                             
   598 3271.395 MiB    0.000 MiB               print('zapisuji samples')
   599                                         # Vytvorit a zapsat 'samples', pokud jsou zadany
   600                                         # parametrem 'samplesArgs'
   601 3271.395 MiB    0.000 MiB               if samplesArgs:
   602                                             sampleSourcesInds[filename] = {}
   603                                             for sampleArg in samplesArgs:
   604                                                 prevNumOfSources = slideshowCreator.getNumOfDataSources()
   605                                                 slideshowCreator.createAndAddDataSource(data[sampleArg], 100, True, 0, 2)
   606                                                 sampleSourcesInds[filename][sampleArg] = list(range(prevNumOfSources,
   607                                                                                                     slideshowCreator.getNumOfDataSources()))
   608                             
   609                                         # Ziskam nazvy parametru z nazvu souboru
   610 3271.395 MiB    0.000 MiB               args = filename[0:filename.rfind('.')].split('_')
   611                             
   612 3271.395 MiB    0.000 MiB               print('tvorim slovnik')
   613                                         # Priradim konkretni hodnoty z nazvu CSV souboru
   614                                         # k zadanym parametrum filenameArgs
   615 3271.395 MiB    0.000 MiB               d = {key: (args[i] if i < len(args) else '') for i, key in enumerate(keyLst)}
   616                             
   617                                         # Pridam do slovniku nactena data ze souboru
   618 3271.395 MiB    0.000 MiB               d['Data'] = data
   619                             
   620 3271.395 MiB    0.000 MiB               print('pridavam slovnik do folderData')
   621 3271.395 MiB    0.000 MiB               folderData.append(d)
   622                             
   623                                     ###############################################################
   624                                     # Rozdelim nactena data ze slozky do skupin podle volitelnych #
   625                                     # argumentu (preconditioner, schur complement...)             #
   626                                     ###############################################################
   627 2742.480 MiB -528.914 MiB           print('rozdeluji data do kategorii')
   628                                     # Ulozene prumerne hodnoty yLabel za vsechna
   629                                     # volani funkce
   630 2742.480 MiB    0.000 MiB           folderDataGroupsAvg = {}
   631                             
   632                                     # Ulozene hodnoty yLabel ze vsech volani fce
   633                                     #
   634                                     # TODO mozna bude treba zapsat jako zdroj pro
   635                                     # graf jednotlivych iteraci solveru
   636 2742.480 MiB    0.000 MiB           folderDataGroupsFull = {}
   637                             
   638 2745.746 MiB    3.266 MiB           for i, val in enumerate(folderData):
   639                                         # Ziskam hodnoty konfiguracnich argumentu
   640                                         # a ulozim je jako n-tici
   641 2745.746 MiB    0.000 MiB               optArgsTup = tuple([str(val[arg]) for arg in otherUserArgs])
   642                             
   643                                         # Pokud jeste neni, pridam n-tici s konfiguracnimi
   644                                         # parametry jako klic pro slovnik s prumernymi
   645                                         # hodnotami spotreby
   646 2745.746 MiB    0.000 MiB               if optArgsTup not in folderDataGroupsAvg:
   647 2743.320 MiB   -2.426 MiB                   folderDataGroupsAvg[optArgsTup] = {}
   648 2743.320 MiB    0.000 MiB                   folderDataGroupsFull[optArgsTup] = {}
   649                             
   650 2745.746 MiB    2.426 MiB               if folderData[i][funcLabelArg] not in folderDataGroupsAvg[optArgsTup]:
   651 2744.539 MiB   -1.207 MiB                   folderDataGroupsAvg[optArgsTup][folderData[i][funcLabelArg]] = []
   652 2744.539 MiB    0.000 MiB                   folderDataGroupsFull[optArgsTup][folderData[i][funcLabelArg]] = []
   653                             
   654                                         # Fce pro ziskani hodnot z Blade summary,
   655                                         # ktere slouzi jako yLabelArg.
   656                                         # Nepsano jako lambda kvuli fyz. delce kodu funkce.
   657 2745.746 MiB    1.207 MiB               def getYLabelVals(ind):
   658 2745.746 MiB    0.000 MiB                   retLst = []
   659 2745.746 MiB    0.000 MiB                   for subLst in folderData[ind]['Data']['Blade summary']:
   660 2745.746 MiB    0.000 MiB                       for item in subLst:
   661 2745.746 MiB    0.000 MiB                           if item[0] == yLabelArg:
   662 2745.746 MiB    0.000 MiB                               retLst.append(float(item[1]))
   663 2745.746 MiB    0.000 MiB                   return retLst
   664                             
   665                                         # Zapisu hodnoty ze vsech volani fce pro jedno nastaveni
   666                                         # do folderDataGroupsFull
   667 2745.746 MiB    0.000 MiB               folderDataGroupsFull[optArgsTup][folderData[i][funcLabelArg]] \
   668 2745.746 MiB    0.000 MiB                   .append((folderData[i][xLabelArg], getYLabelVals(i)))
   669                             
   670                                         # Ziskam prumernou spotrebu ze vsech volani fce pro jedno
   671                                         # nastaveni (Prec, Schur) a jeden popisek funkce
   672                                         # (pocet jader...).
   673                                         #
   674                                         # TYTO UDAJE PRIDAM do folderDataGroupsAvg.
   675 2745.746 MiB    0.000 MiB               folderDataGroupsAvg[optArgsTup][folderData[i][funcLabelArg]] \
   676 2745.746 MiB    0.000 MiB                   .append((folderData[i][xLabelArg], numpy.mean(getYLabelVals(i))))
   677                             
   678 2745.746 MiB    0.000 MiB           print('zapidu folderDataGroupsAvg jako zdroj')
   679                                     # Ziskani dat z folderDataGroupsAvg a jejich zapis jako zdroje
   680 2746.160 MiB    0.414 MiB           for key, vals in sorted(folderDataGroupsAvg.items()):
   681 2746.160 MiB    0.000 MiB               summarySourcesAvg[key] = {}
   682                             
   683                                         # TODO promyslet, jestli nebude lepsi sloucit summarySourcesAvg a summarySourcesAvgInds
   684                                         # do jednoho slovniku
   685 2746.160 MiB    0.000 MiB               summarySourcesAvgInds[key] = {}
   686                             
   687 2746.160 MiB    0.000 MiB               for subKey, val in sorted(vals.items()):
   688                                             # Zapisu do listu zdroje pro danou konfiguraci - pro vypocty procent atd.
   689 2746.160 MiB    0.000 MiB                   summarySourcesAvg[key][subKey] = val
   690                             
   691                                             # Zapisu data do zdroju pro vykreslovani grafu
   692 2746.160 MiB    0.000 MiB                   summarySourcesAvgInds[key][subKey] = slideshowCreator.getNumOfDataSources()
   693 2746.160 MiB    0.000 MiB                   slideshowCreator.createAndAddDataSourcesTexCode([sorted(val)], 0, False, 0, 1)
   694                             
   695 2746.160 MiB    0.000 MiB           print('ziskam data z folderDataGroupsFull')
   696                                     # Ziskani dat z folderDataGroupsFull
   697                                     #
   698                                     # TODO mozna bude potreba i zapis zdroju pro
   699                                     # grafy jednotlivych iteraci
   700 2746.160 MiB    0.000 MiB           for key, vals in sorted(folderDataGroupsFull.items()):
   701 2746.160 MiB    0.000 MiB               summarySourcesFull[key] = {}
   702                             
   703 2746.160 MiB    0.000 MiB               for subKey, val in sorted(vals.items()):
   704                                             # Zapisu do listu zdroje pro danou konfiguraci - pro vypocty procent atd.
   705 2746.160 MiB    0.000 MiB                   summarySourcesFull[key][subKey] = val
   706                             
   707 2746.160 MiB    0.000 MiB       return summarySourcesAvg, sampleSourcesInds, summarySourcesAvgInds, summarySourcesFull

0 个答案:

没有答案