使用Python解析和重组CSV文件

时间:2014-02-10 04:50:18

标签: python sorting hash hashcode autovivification

Python Gurus,

过去,我一直在使用Perl来处理非常大的文本文件以进行数据挖掘。最近我决定转换,因为我相信Python让我更容易通过我的代码并弄清楚发生了什么。 关于Python的不幸(或者可能是幸运的)事情是,与Perl相比,存储和组织数据非常困难,因为我无法通过自动生成来创建哈希值。我也无法总结词典词典的元素。

也许这是我问题的优雅解决方案。

我有数百个带有数百行数据的文件(都可以放在内存中)。目标是将这两个文件合并,但具有某些标准:

  1. 对于每个级别(仅显示以下一个级别),我需要为在所有文件中找到的每个缺陷类创建一行。并非所有文件都有相同的缺陷。

  2. 对于每个级别和缺陷类,总结所有GEC&在所有文件中找到BEC值。

  3. 最终输出应如下(更新样本输出,拼写错误):

  4.   

    Level defectClass BECtotals GECtotals
      1415PA,0,643,1991
      1415PA,1,1994,6470
      ......等等......

    文件一:

    Level,  defectClass,    BEC,    GEC
    1415PA,      0,         262,    663
    1415PA,      1,         1138,   4104
    1415PA,    107,     2,  0
    1415PA,     14,         3,  4
    1415PA,     15,         1,  0
    1415PA,      2,         446,    382
    1415PA,     21,         5,  0
    1415PA,     23,         10, 5
    1415PA,      4,         3,  16
    1415PA,      6,        52,  105
    

    文件二:

    level,  defectClass,   BEC, GEC
    1415PA, 0,     381, 1328
    1415PA, 1,     856, 2366
    1415PA, 107,       7,   11
    1415PA, 14,    4,   1
    1415PA, 2,     315, 202
    1415PA, 23,    4,   7
    1415PA, 4,     0,   2
    1415PA, 6,     46,  42
    1415PA, 7,     1,   7
    

    我遇到的最大问题是能够对词典进行总结。这是我到目前为止的代码(不工作):

    import os
    import sys
    
    
    class AutoVivification(dict):
        """Implementation of perl's autovivification feature. Has features from both dicts and lists,
        dynamically generates new subitems as needed, and allows for working (somewhat) as a basic type.
        """
        def __getitem__(self, item):
        if isinstance(item, slice):
            d = AutoVivification()
            items = sorted(self.iteritems(), reverse=True)
            k,v = items.pop(0)
            while 1:
            if (item.start < k < item.stop):
                d[k] = v
            elif k > item.stop:
                break
            if item.step:
                for x in range(item.step):
                k,v = items.pop(0)
            else:
                k,v = items.pop(0)
            return d
        try:
            return dict.__getitem__(self, item)
        except KeyError:
            value = self[item] = type(self)()
            return value
    
        def __add__(self, other):
        """If attempting addition, use our length as the 'value'."""
        return len(self) + other
    
        def __radd__(self, other):
        """If the other type does not support addition with us, this addition method will be tried."""
        return len(self) + other
    
        def append(self, item):
        """Add the item to the dict, giving it a higher integer key than any currently in use."""
        largestKey = sorted(self.keys())[-1]
        if isinstance(largestKey, str):
            self.__setitem__(0, item)
        elif isinstance(largestKey, int):
            self.__setitem__(largestKey+1, item)
    
        def count(self, item):
        """Count the number of keys with the specified item."""
        return sum([1 for x in self.items() if x == item])
    
        def __eq__(self, other):
        """od.__eq__(y) <==> od==y. Comparison to another AV is order-sensitive
        while comparison to a regular mapping is order-insensitive. """
        if isinstance(other, AutoVivification):
            return len(self)==len(other) and self.items() == other.items()
        return dict.__eq__(self, other)
    
        def __ne__(self, other):
        """od.__ne__(y) <==> od!=y"""
        return not self == other
    
    for filename in os.listdir('/Users/aleksarias/Desktop/DefectMatchingDatabase/'):
        if filename[0] == '.' or filename == 'YieldToDefectDatabaseJan2014Continued.csv':
        continue
        path = '/Users/aleksarias/Desktop/DefectMatchingDatabase/' + filename
    
        for filename2 in os.listdir(path):
        if filename2[0] == '.':
            continue
        path2 = path + "/" + filename2
        techData = AutoVivification()
    
        for file in os.listdir(path2):
            if file[0:13] == 'SummaryRearr_':
            dataFile = path2 + '/' + file
            print('Location of file to read: ', dataFile, '\n')
            fh = open(dataFile, 'r')
    
            for lines in fh:
                if lines[0:5] == 'level':
                continue
                lines = lines.strip()
                elements = lines.split(',')
    
                if techData[elements[0]][elements[1]]['BEC']:
                techData[elements[0]][elements[1]]['BEC'].append(elements[2])
                else:
                techData[elements[0]][elements[1]]['BEC'] = elements[2]
    
                if techData[elements[0]][elements[1]]['GEC']:
                techData[elements[0]][elements[1]]['GEC'].append(elements[3])
                else:
                techData[elements[0]][elements[1]]['GEC'] = elements[3]
    
    
                print(elements[0], elements[1], techData[elements[0]][elements[1]]['BEC'], techData[elements[0]][elements[1]]['GEC'])
    
        techSumPath = path + '/Summary_' + filename + '.csv'
        fh2 = open(techSumPath, 'w')
        for key1 in sorted(techData):
        for key2 in sorted(techData[key1]):
            BECtotal = sum(map(int, techData[key1][key2]['BEC']))
            GECtotal = sum(map(int, techData[key1][key2]['GEC']))
            fh2.write('%s,%s,%s,%s\n' % (key1, key2, BECtotal, GECtotal))
        print('Created file at:', techSumPath)
        input('Go check the file!!!!')
    

    谢谢你看看这个!!!!!
    亚历

1 个答案:

答案 0 :(得分:3)

我将建议一种不同的方法:如果您正在处理表格数据,则应该查看pandas库。你的代码就像是

import pandas as pd

filenames = "fileone.txt", "filetwo.txt"  # or whatever

dfs = []
for filename in filenames:
    df = pd.read_csv(filename, skipinitialspace=True)
    df = df.rename(columns={"level": "Level"})
    dfs.append(df)

df_comb = pd.concat(dfs)
df_totals = df_comb.groupby(["Level", "defectClass"], as_index=False).sum()
df_totals.to_csv("combined.csv", index=False)

产生

dsm@notebook:~/coding/pand$ cat combined.csv 
Level,defectClass,BEC,GEC
1415PA,0,643,1991
1415PA,1,1994,6470
1415PA,2,761,584
1415PA,4,3,18
1415PA,6,98,147
1415PA,7,1,7
1415PA,14,7,5
1415PA,15,1,0
1415PA,21,5,0
1415PA,23,14,12
1415PA,107,9,11

这里我已经将每个文件同时读入内存并将它们组合成一个大的DataFrame(如Excel表格),但我们可以轻松地按文件完成groupby操作文件如果我们喜欢的话,我们一次只需要在内存中有一个文件。