Question

Python Gurus，

过去，我一直在使用Perl来处理非常大的文本文件以进行数据挖掘。最近我决定转换，因为我相信Python让我更容易通过我的代码并弄清楚发生了什么。关于Python的不幸（或者可能是幸运的）事情是，与Perl相比，存储和组织数据非常困难，因为我无法通过自动生成来创建哈希值。我也无法总结词典词典的元素。

也许这是我问题的优雅解决方案。

我有数百个带有数百行数据的文件（都可以放在内存中）。目标是将这两个文件合并，但具有某些标准：

对于每个级别（仅显示以下一个级别），我需要为在所有文件中找到的每个缺陷类创建一行。并非所有文件都有相同的缺陷。
对于每个级别和缺陷类，总结所有GEC＆amp;在所有文件中找到BEC值。
最终输出应如下（更新样本输出，拼写错误）：

Level defectClass BECtotals GECtotals
  1415PA，0,643,1991
  1415PA，1,1994,6470
  ......等等......

文件一：

Level,  defectClass,    BEC,    GEC
1415PA,      0,         262,    663
1415PA,      1,         1138,   4104
1415PA,    107,     2,  0
1415PA,     14,         3,  4
1415PA,     15,         1,  0
1415PA,      2,         446,    382
1415PA,     21,         5,  0
1415PA,     23,         10, 5
1415PA,      4,         3,  16
1415PA,      6,        52,  105

文件二：

level,  defectClass,   BEC, GEC
1415PA, 0,     381, 1328
1415PA, 1,     856, 2366
1415PA, 107,       7,   11
1415PA, 14,    4,   1
1415PA, 2,     315, 202
1415PA, 23,    4,   7
1415PA, 4,     0,   2
1415PA, 6,     46,  42
1415PA, 7,     1,   7

我遇到的最大问题是能够对词典进行总结。这是我到目前为止的代码（不工作）：

import os
import sys


class AutoVivification(dict):
    """Implementation of perl's autovivification feature. Has features from both dicts and lists,
    dynamically generates new subitems as needed, and allows for working (somewhat) as a basic type.
    """
    def __getitem__(self, item):
    if isinstance(item, slice):
        d = AutoVivification()
        items = sorted(self.iteritems(), reverse=True)
        k,v = items.pop(0)
        while 1:
        if (item.start < k < item.stop):
            d[k] = v
        elif k > item.stop:
            break
        if item.step:
            for x in range(item.step):
            k,v = items.pop(0)
        else:
            k,v = items.pop(0)
        return d
    try:
        return dict.__getitem__(self, item)
    except KeyError:
        value = self[item] = type(self)()
        return value

    def __add__(self, other):
    """If attempting addition, use our length as the 'value'."""
    return len(self) + other

    def __radd__(self, other):
    """If the other type does not support addition with us, this addition method will be tried."""
    return len(self) + other

    def append(self, item):
    """Add the item to the dict, giving it a higher integer key than any currently in use."""
    largestKey = sorted(self.keys())[-1]
    if isinstance(largestKey, str):
        self.__setitem__(0, item)
    elif isinstance(largestKey, int):
        self.__setitem__(largestKey+1, item)

    def count(self, item):
    """Count the number of keys with the specified item."""
    return sum([1 for x in self.items() if x == item])

    def __eq__(self, other):
    """od.__eq__(y) <==> od==y. Comparison to another AV is order-sensitive
    while comparison to a regular mapping is order-insensitive. """
    if isinstance(other, AutoVivification):
        return len(self)==len(other) and self.items() == other.items()
    return dict.__eq__(self, other)

    def __ne__(self, other):
    """od.__ne__(y) <==> od!=y"""
    return not self == other

for filename in os.listdir('/Users/aleksarias/Desktop/DefectMatchingDatabase/'):
    if filename[0] == '.' or filename == 'YieldToDefectDatabaseJan2014Continued.csv':
    continue
    path = '/Users/aleksarias/Desktop/DefectMatchingDatabase/' + filename

    for filename2 in os.listdir(path):
    if filename2[0] == '.':
        continue
    path2 = path + "/" + filename2
    techData = AutoVivification()

    for file in os.listdir(path2):
        if file[0:13] == 'SummaryRearr_':
        dataFile = path2 + '/' + file
        print('Location of file to read: ', dataFile, '\n')
        fh = open(dataFile, 'r')

        for lines in fh:
            if lines[0:5] == 'level':
            continue
            lines = lines.strip()
            elements = lines.split(',')

            if techData[elements[0]][elements[1]]['BEC']:
            techData[elements[0]][elements[1]]['BEC'].append(elements[2])
            else:
            techData[elements[0]][elements[1]]['BEC'] = elements[2]

            if techData[elements[0]][elements[1]]['GEC']:
            techData[elements[0]][elements[1]]['GEC'].append(elements[3])
            else:
            techData[elements[0]][elements[1]]['GEC'] = elements[3]


            print(elements[0], elements[1], techData[elements[0]][elements[1]]['BEC'], techData[elements[0]][elements[1]]['GEC'])

    techSumPath = path + '/Summary_' + filename + '.csv'
    fh2 = open(techSumPath, 'w')
    for key1 in sorted(techData):
    for key2 in sorted(techData[key1]):
        BECtotal = sum(map(int, techData[key1][key2]['BEC']))
        GECtotal = sum(map(int, techData[key1][key2]['GEC']))
        fh2.write('%s,%s,%s,%s\n' % (key1, key2, BECtotal, GECtotal))
    print('Created file at:', techSumPath)
    input('Go check the file!!!!')

谢谢你看看这个!!!!!
亚历

Answer 1

我将建议一种不同的方法：如果您正在处理表格数据，则应该查看pandas库。你的代码就像是

import pandas as pd

filenames = "fileone.txt", "filetwo.txt"  # or whatever

dfs = []
for filename in filenames:
    df = pd.read_csv(filename, skipinitialspace=True)
    df = df.rename(columns={"level": "Level"})
    dfs.append(df)

df_comb = pd.concat(dfs)
df_totals = df_comb.groupby(["Level", "defectClass"], as_index=False).sum()
df_totals.to_csv("combined.csv", index=False)

产生

dsm@notebook:~/coding/pand$ cat combined.csv 
Level,defectClass,BEC,GEC
1415PA,0,643,1991
1415PA,1,1994,6470
1415PA,2,761,584
1415PA,4,3,18
1415PA,6,98,147
1415PA,7,1,7
1415PA,14,7,5
1415PA,15,1,0
1415PA,21,5,0
1415PA,23,14,12
1415PA,107,9,11

这里我已经将每个文件同时读入内存并将它们组合成一个大的DataFrame（如Excel表格），但我们可以轻松地按文件完成groupby操作文件如果我们喜欢的话，我们一次只需要在内存中有一个文件。

使用Python解析和重组CSV文件

1 个答案: