Python - 迭代文本文件并创建字典字典

时间:2015-08-22 12:49:18

标签: python dictionary

我有3个空格分隔列的文本文件。我试图找到列A有多少列B已通过。如果在C列中没有除Pass之外的状态,则B列中的值被认为是Pass。因此,在PRO-16下面的示例数据被认为是失败而PRO-18是Pass,依此类推。 代码方面我尝试在dict中转换它并迭代内部字典以查找列C是否有任何其他状态传递给B列但没有运气。 非常感谢你的帮助!!

编辑:这是我用来构建dict的代码,但它只读取文本文件的第一行: myFile = pd.read_csv('SIT Req.txt')

dataDict={}
for line in myFile:
        words = line.strip().split()
        fa = words[0]
        req = words[1]
        state = words[2]
        innerDict = dataDict.setdefault(fa, {})
        innerDict[req] = state

FT  PRO-16  Passed
FT  PRO-16  Failed
FT  PRO-18  Passed
FT  PRO-18  Passed
FT  PRO-19  Passed
FT  PRO-20  Failed
FT  PRO-21  No Run
FT  GR-01   Passed
FT  GR-02   Passed
FT  GR-02   Passed
FT  GR-02   Passed
FT  GR-03   Passed
LE  GR-19   Passed
LE  GR-19   Passed
LE  GR-20   Passed
LE  GR-21   Failed
LE  GR-22   Passed
LE  DEL-14  Passed
LE  DEL-14  Passed
LE  DEL-14  Passed
LE  DEL-15  Failed
LE  PRO-43  Failed
LE  PRO-45  Passed
LE  PRO-51  Passed
CD  GR-07   Passed
CD  GR-07   Failed
CD  GR-09   Passed
CD  GR-07   Passed
CD  GR-07   Passed
CD  GR-13   No Run
CD  GR-13   No Run
CD  GR-13   No Run
CD  GR-13   Failed

3 个答案:

答案 0 :(得分:1)

您可以使用collections.defaultdict创建一个字典,其中列A作为键,每个字段的值为defaultdict(list)。嵌套的defaultdict(list)使用列B作为键和列C中的值列表。

以下代码创建了这样一个字典,然后使用它来为每列A生成已传递列B项的计数。

from pandas import read_csv
from collections import defaultdict

data = defaultdict(lambda : defaultdict(list))

df = read_csv('datafile', sep='\t')
for a, b, c in df.values:
    data[a][b].append(c)

#from pprint import pprint
#pprint(data.items())

# output the total number of passes for each "A" in which all runs of "B" passed.
result_counts = {a: sum(1 for b in data[a] if all(c=='Passed' for c in data[a][b])) for a in data}
print('Counts: {}'.format(result_counts))

# output for each "A" a list of all passed "B"s.
result_passed = {a: list(b for b in data[a] if all(c=='Passed' for c in data[a][b])) for a in data}
print('Passed: {}'.format(result_passed))

<强>输出

Counts: {'LE': 6, 'FT': 5, 'CD': 1}
Passed: {'LE': ['DEL-14', 'PRO-45', 'PRO-51', 'GR-19', 'GR-22', 'GR-20'], 'FT': ['PRO-19', 'PRO-18', 'GR-01', 'GR-03', 'GR-02'], 'CD': ['GR-09']}

<强>更新

关于迭代数据框时遇到的麻烦,我看到了两个问题。首先,read_csv的默认字段分隔符是逗号。您的数据似乎是制表符分隔的。其次,您无法直接在数据框上进行迭代。尝试使用以下之一(我提供一些,因为它们具有不同的性能特征):

df = pd.read_csv('SIT Req.tx', sep='\t')    # note use of sep

for a, b, c in df.values:
    ...
# or
for i, a, b, c in df.itertuples():
    ...
# or
for i, row in df.iterrows():
    a, b, c = row
    ...

更新2

以下是字典理解的长版本,它从B列中选择所有测试通过的项目:

result_passed = {}
for a in data:
    result_passed[a] = []
    for b in data[a]:
        passed = True
        for c in data[a][b]:
            if c != 'Passed':
                passed = False
                break
        if passed:
            result_passed[a].append(b)

通过查看data词典的内容和结构,您可以更好地了解其工作原理:

>>> from pprint import pprint
>>> pprint(data.items())
[('LE',
  defaultdict(<type 'list'>, {'DEL-15': ['Failed'], 'DEL-14': ['Passed', 'Passed', 'Passed'], 'PRO-43': ['Failed'], 'PRO-45': ['Passed'], 'PRO-51': ['Passed'], 'GR-19': ['Passed', 'Passed'], 'GR-22': ['Passed'], 'GR-21': ['Failed'], 'GR-20': ['Passed']})),
 ('FT',
  defaultdict(<type 'list'>, {'PRO-19': ['Passed'], 'PRO-20': ['Failed'], 'PRO-21': ['No Run'], 'PRO-16': ['Failed'], 'PRO-18': ['Passed', 'Passed'], 'GR-01': ['Passed'], 'GR-03': ['Passed'], 'GR-02': ['Passed', 'Passed', 'Passed']})),
 ('CD',
  defaultdict(<type 'list'>, {'GR-07': ['Passed', 'Failed', 'Passed', 'Passed'], 'GR-09': ['Passed'], 'GR-13': ['No Run', 'No Run', 'No Run', 'Failed']}))]

答案 1 :(得分:0)

您可以使用defaultdict:

 from collections import defaultdict
 d = defaultdict(lambda : defaultdict(lambda : True))
 for line in f:
    words = line.split()
    if words[2]!='Passed':
       d[words[0]][words[1]] = False

In [49]: d['FT']['PRO-18']
Out[49]: True

In [50]: d['FT']['PRO-16']
Out[50]: False

答案 2 :(得分:0)

根据您的数据,B列的项目似乎都在列A条目的范围内,即A列似乎是连续的。如果是这种情况,并且在处理大文件时,可能采用以下方法:

import csv, itertools

with open('input.csv', 'r') as f_input:
    csv_input = csv.reader(f_input, delimiter=" ", skipinitialspace=True)

    for k1, g1 in itertools.groupby(csv_input, key=lambda x: x[0]):
        group = sorted(g1, key=lambda x: x[1])
        for k2, g2 in itertools.groupby(group, key=lambda x: x[1]):
            if all((cols[2] == 'Passed' for cols in g2)):
                print "%s %s Passed" % (k1, k2)
            else:
                print "%s %s Failed" % (k1, k2)

对于您提供的数据,将显示以下结果:

FT GR-01 Passed
FT GR-02 Passed
FT GR-03 Passed
FT PRO-16 Failed
FT PRO-18 Passed
FT PRO-19 Passed
FT PRO-20 Failed
FT PRO-21 Failed
LE DEL-14 Passed
LE DEL-15 Failed
LE GR-19 Passed
LE GR-20 Passed
LE GR-21 Failed
LE GR-22 Passed
LE PRO-43 Failed
LE PRO-45 Passed
LE PRO-51 Passed
CD GR-07 Failed
CD GR-09 Passed
CD GR-13 Failed