我有几个文件夹,每个文件夹都包含许多CSV文件。我需要解析它们以制作一些情节等。
问题是,内存消耗量巨大,之后它迅速增长到cca 3.4GB。我不知道这种行为的原因是什么,我以前从未遇到过与Python有关的内存问题。
我找到了memory profiler,它给了我这个输出:
Filename: /home/martin/PycharmProjects/readex-radar/fooVisualizer.py
Line # Mem usage Increment Line Contents
================================================
560 293.340 MiB 0.000 MiB @profile
561 def getAllDataClassifiedFromFolder(measuredFuncFolderArg,
562 yLabelArg,
563 filenameArgs,
564 slideshowCreator,
565 samplesArgs=None):
566
567 293.340 MiB 0.000 MiB sampleSourcesInds = {}
568 293.340 MiB 0.000 MiB summarySourcesAvg = {}
569 293.340 MiB 0.000 MiB summarySourcesAvgInds = {}
570 293.340 MiB 0.000 MiB summarySourcesFull = {}
571
572 3342.426 MiB 3049.086 MiB for dirpath, dirnames, filenames in os.walk(measuredFuncFolderArg, topdown=True):
573 # Ignorovat skryte slozky kvuli GITu atd.
574 # TODO mozna nebude treba
575 293.340 MiB -3049.086 MiB filenames = [f for f in filenames if not f[0] == '.']
576 293.340 MiB 0.000 MiB dirnames[:] = [d for d in dirnames if not d[0] == '.']
577
578 ##########################
579 # Parsovani jedne slozky #
580 ##########################
581
582 # Vsechna data z jedne slozky (funkce na urcitem radku)
583 293.340 MiB 0.000 MiB folderData = []
584
585 # Nacitam parametry dane v nazvu CSV souboru
586 293.340 MiB 0.000 MiB funcLabelArg = filenameArgs.getFuncLabel()
587 293.340 MiB 0.000 MiB xLabelArg = filenameArgs.getXLabel()
588 293.340 MiB 0.000 MiB otherUserArgs = filenameArgs.getConfigLst()
589
590 # 'Rozlozim' config argument na jednotlive hodnoty
591 293.340 MiB 0.000 MiB keyLst = filenameArgs.getLstOfParams()
592
593 3382.207 MiB 3088.867 MiB for filename in filenames:
594 # Nactu data
595 3382.207 MiB 0.000 MiB p = LabeledCSVParser('{}/{}'.format(dirpath, filename))
596 3382.207 MiB 0.000 MiB p.parse()
597 3382.207 MiB 0.000 MiB data = p.getDicData()
598
599 # Vytvorit a zapsat 'samples', pokud jsou zadany
600 # parametrem 'samplesArgs'
我们可以看到,在第572和593行中,内存消耗急剧上升。你知道为什么吗?我想这是os.walk
...
所以,你以前见过这个吗?如果你有,请你告诉我,如何解决这个问题?
我试图在LabeledCSVParser
对象中添加一个显式的析构函数,但它对内存的影响很小。此外,似乎parse()
函数中没有显着的内存消耗。
所以,我会尝试再检查迭代文件。
Filename: /home/martin/PycharmProjects/readex-radar/fooVisualizer.py
Line # Mem usage Increment Line Contents
================================================
100 3271.395 MiB 0.000 MiB @profile
101 def parse(self):
102 """
103 Funkce pro parsovani 'ostitkovaneho' CSV.
104
105 Predpoklada CSV ve tvaru:
106
107 # Label 1
108 data1, data2
109 data3, data4
110
111 # Label 2
112 data5, data6
113 data7, data8
114 ...
115
116 Labely se mohou opakovat, ziskane hodnoty
117 se ulozi do ruznych listu ve slovniku __dicData,
118 kde jejich spolecnym klicem bude label.
119 """
120
121 3271.395 MiB 0.000 MiB currentLabel = self.__parsedFile.readline().split('#')[1].strip()
122 3271.395 MiB 0.000 MiB dataBlock = list()
123 3271.395 MiB 0.000 MiB self.__dicData[currentLabel] = list()
124
125 # Ulozim soucasny dataBlock do __dicData pod klic currentLabel
126 # - hodnoty se do tohoto dataBlocku zapisuji pozdeji diky
127 # referenci
128 3271.395 MiB 0.000 MiB self.__dicData[currentLabel].append(dataBlock)
129
130 3272.949 MiB 1.555 MiB for row in self.__parsedFile:
131 # Kontrola, jestli se jedna o label nebo radek s daty
132 3272.949 MiB 0.000 MiB if row.__contains__('#'):
133
134 # Vytvorim novy dataBlock
135 3271.395 MiB -1.555 MiB dataBlock = list()
136
137 # Zisk nazvu labelu z radku
138 3271.395 MiB 0.000 MiB tmpLabel = row.split('#')[1].strip()
139
140 # Label se stane 'aktualnim' - nasledujici
141 # data se budou zapisovat k nemu
142 3271.395 MiB 0.000 MiB currentLabel = tmpLabel
143
144 # Pokud neni label 'zaevidovany', pridam
145 # jej do __dicData jako klic
146 3271.395 MiB 0.000 MiB if currentLabel not in self.__dicData.keys():
147 3271.395 MiB 0.000 MiB self.__dicData[currentLabel] = list()
148
149 3271.395 MiB 0.000 MiB self.__dicData[currentLabel].append(dataBlock)
150 else:
151 # Pridam rozparsovany radek do aktualniho
152 # bloku dat jako n-tici
153 3272.949 MiB 1.555 MiB dataBlock.append(tuple(row.strip().split(',')))
Filename: /home/martin/PycharmProjects/readex-radar/fooVisualizer.py
Line # Mem usage Increment Line Contents
================================================
558 293.543 MiB 0.000 MiB @profile
559 def getAllDataClassifiedFromFolder(measuredFuncFolderArg,
560 yLabelArg,
561 filenameArgs,
562 slideshowCreator,
563 samplesArgs=None):
564
565 293.543 MiB 0.000 MiB sampleSourcesInds = {}
566 293.543 MiB 0.000 MiB summarySourcesAvg = {}
567 293.543 MiB 0.000 MiB summarySourcesAvgInds = {}
568 293.543 MiB 0.000 MiB summarySourcesFull = {}
569
570 2746.160 MiB 2452.617 MiB for dirpath, dirnames, filenames in os.walk(measuredFuncFolderArg, topdown=True):
571 # Ignorovat skryte slozky kvuli GITu atd.
572 # TODO mozna nebude treba
573 #filenames = [f for f in filenames if not f[0] == '.']
574 #dirnames[:] = [d for d in dirnames if not d[0] == '.']
575
576 ##########################
577 # Parsovani jedne slozky #
578 ##########################
579
580 # Vsechna data z jedne slozky (funkce na urcitem radku)
581 293.543 MiB -2452.617 MiB folderData = []
582
583 # Nacitam parametry dane v nazvu CSV souboru
584 293.543 MiB 0.000 MiB funcLabelArg = filenameArgs.getFuncLabel()
585 293.543 MiB 0.000 MiB xLabelArg = filenameArgs.getXLabel()
586 293.543 MiB 0.000 MiB otherUserArgs = filenameArgs.getConfigLst()
587
588 # 'Rozlozim' config argument na jednotlive hodnoty
589 293.543 MiB 0.000 MiB keyLst = filenameArgs.getLstOfParams()
590
591 293.543 MiB 0.000 MiB print('nacitam data')
592 3271.395 MiB 2977.852 MiB for filename in filenames:
593 # Nactu data
594 3271.395 MiB 0.000 MiB p = LabeledCSVParser('{}/{}'.format(dirpath, filename))
595 3271.395 MiB 0.000 MiB p.parse()
596 3271.395 MiB 0.000 MiB data = p.getDicData()
597
598 3271.395 MiB 0.000 MiB print('zapisuji samples')
599 # Vytvorit a zapsat 'samples', pokud jsou zadany
600 # parametrem 'samplesArgs'
601 3271.395 MiB 0.000 MiB if samplesArgs:
602 sampleSourcesInds[filename] = {}
603 for sampleArg in samplesArgs:
604 prevNumOfSources = slideshowCreator.getNumOfDataSources()
605 slideshowCreator.createAndAddDataSource(data[sampleArg], 100, True, 0, 2)
606 sampleSourcesInds[filename][sampleArg] = list(range(prevNumOfSources,
607 slideshowCreator.getNumOfDataSources()))
608
609 # Ziskam nazvy parametru z nazvu souboru
610 3271.395 MiB 0.000 MiB args = filename[0:filename.rfind('.')].split('_')
611
612 3271.395 MiB 0.000 MiB print('tvorim slovnik')
613 # Priradim konkretni hodnoty z nazvu CSV souboru
614 # k zadanym parametrum filenameArgs
615 3271.395 MiB 0.000 MiB d = {key: (args[i] if i < len(args) else '') for i, key in enumerate(keyLst)}
616
617 # Pridam do slovniku nactena data ze souboru
618 3271.395 MiB 0.000 MiB d['Data'] = data
619
620 3271.395 MiB 0.000 MiB print('pridavam slovnik do folderData')
621 3271.395 MiB 0.000 MiB folderData.append(d)
622
623 ###############################################################
624 # Rozdelim nactena data ze slozky do skupin podle volitelnych #
625 # argumentu (preconditioner, schur complement...) #
626 ###############################################################
627 2742.480 MiB -528.914 MiB print('rozdeluji data do kategorii')
628 # Ulozene prumerne hodnoty yLabel za vsechna
629 # volani funkce
630 2742.480 MiB 0.000 MiB folderDataGroupsAvg = {}
631
632 # Ulozene hodnoty yLabel ze vsech volani fce
633 #
634 # TODO mozna bude treba zapsat jako zdroj pro
635 # graf jednotlivych iteraci solveru
636 2742.480 MiB 0.000 MiB folderDataGroupsFull = {}
637
638 2745.746 MiB 3.266 MiB for i, val in enumerate(folderData):
639 # Ziskam hodnoty konfiguracnich argumentu
640 # a ulozim je jako n-tici
641 2745.746 MiB 0.000 MiB optArgsTup = tuple([str(val[arg]) for arg in otherUserArgs])
642
643 # Pokud jeste neni, pridam n-tici s konfiguracnimi
644 # parametry jako klic pro slovnik s prumernymi
645 # hodnotami spotreby
646 2745.746 MiB 0.000 MiB if optArgsTup not in folderDataGroupsAvg:
647 2743.320 MiB -2.426 MiB folderDataGroupsAvg[optArgsTup] = {}
648 2743.320 MiB 0.000 MiB folderDataGroupsFull[optArgsTup] = {}
649
650 2745.746 MiB 2.426 MiB if folderData[i][funcLabelArg] not in folderDataGroupsAvg[optArgsTup]:
651 2744.539 MiB -1.207 MiB folderDataGroupsAvg[optArgsTup][folderData[i][funcLabelArg]] = []
652 2744.539 MiB 0.000 MiB folderDataGroupsFull[optArgsTup][folderData[i][funcLabelArg]] = []
653
654 # Fce pro ziskani hodnot z Blade summary,
655 # ktere slouzi jako yLabelArg.
656 # Nepsano jako lambda kvuli fyz. delce kodu funkce.
657 2745.746 MiB 1.207 MiB def getYLabelVals(ind):
658 2745.746 MiB 0.000 MiB retLst = []
659 2745.746 MiB 0.000 MiB for subLst in folderData[ind]['Data']['Blade summary']:
660 2745.746 MiB 0.000 MiB for item in subLst:
661 2745.746 MiB 0.000 MiB if item[0] == yLabelArg:
662 2745.746 MiB 0.000 MiB retLst.append(float(item[1]))
663 2745.746 MiB 0.000 MiB return retLst
664
665 # Zapisu hodnoty ze vsech volani fce pro jedno nastaveni
666 # do folderDataGroupsFull
667 2745.746 MiB 0.000 MiB folderDataGroupsFull[optArgsTup][folderData[i][funcLabelArg]] \
668 2745.746 MiB 0.000 MiB .append((folderData[i][xLabelArg], getYLabelVals(i)))
669
670 # Ziskam prumernou spotrebu ze vsech volani fce pro jedno
671 # nastaveni (Prec, Schur) a jeden popisek funkce
672 # (pocet jader...).
673 #
674 # TYTO UDAJE PRIDAM do folderDataGroupsAvg.
675 2745.746 MiB 0.000 MiB folderDataGroupsAvg[optArgsTup][folderData[i][funcLabelArg]] \
676 2745.746 MiB 0.000 MiB .append((folderData[i][xLabelArg], numpy.mean(getYLabelVals(i))))
677
678 2745.746 MiB 0.000 MiB print('zapidu folderDataGroupsAvg jako zdroj')
679 # Ziskani dat z folderDataGroupsAvg a jejich zapis jako zdroje
680 2746.160 MiB 0.414 MiB for key, vals in sorted(folderDataGroupsAvg.items()):
681 2746.160 MiB 0.000 MiB summarySourcesAvg[key] = {}
682
683 # TODO promyslet, jestli nebude lepsi sloucit summarySourcesAvg a summarySourcesAvgInds
684 # do jednoho slovniku
685 2746.160 MiB 0.000 MiB summarySourcesAvgInds[key] = {}
686
687 2746.160 MiB 0.000 MiB for subKey, val in sorted(vals.items()):
688 # Zapisu do listu zdroje pro danou konfiguraci - pro vypocty procent atd.
689 2746.160 MiB 0.000 MiB summarySourcesAvg[key][subKey] = val
690
691 # Zapisu data do zdroju pro vykreslovani grafu
692 2746.160 MiB 0.000 MiB summarySourcesAvgInds[key][subKey] = slideshowCreator.getNumOfDataSources()
693 2746.160 MiB 0.000 MiB slideshowCreator.createAndAddDataSourcesTexCode([sorted(val)], 0, False, 0, 1)
694
695 2746.160 MiB 0.000 MiB print('ziskam data z folderDataGroupsFull')
696 # Ziskani dat z folderDataGroupsFull
697 #
698 # TODO mozna bude potreba i zapis zdroju pro
699 # grafy jednotlivych iteraci
700 2746.160 MiB 0.000 MiB for key, vals in sorted(folderDataGroupsFull.items()):
701 2746.160 MiB 0.000 MiB summarySourcesFull[key] = {}
702
703 2746.160 MiB 0.000 MiB for subKey, val in sorted(vals.items()):
704 # Zapisu do listu zdroje pro danou konfiguraci - pro vypocty procent atd.
705 2746.160 MiB 0.000 MiB summarySourcesFull[key][subKey] = val
706
707 2746.160 MiB 0.000 MiB return summarySourcesAvg, sampleSourcesInds, summarySourcesAvgInds, summarySourcesFull