迭代CSV并按行值更改索引

时间:2018-01-17 19:25:49

标签: python python-3.x csv

我有一个包含以下内容的CSV:

ID    Name    Series    Value
250   A       3         20
250   A       3         40
250   A       3         60
251   B       4         16
251   B       4         18
251   B       4         24
251   B       4         42

Series列表示有多少元素彼此属于,因此取第一行(不是标题行)Series = 3。所以我需要收集Series指定的行数,包括当前行。这样它们就像这样分组(Value):

[(20, 40, 60), (16, 18, 24, 42)]

基本上,我按顺序向下移动CSV,而Series告诉我要收集多少下一行(包括当前行)。如果我们再次使用第一行,则值为3,因此我的分组必须从当前行开始总计3行。

我已经在CSV中读取并将其从Reader转换为List,但是无法根据找到的值主动更改行上的迭代索引的解决方案在系列中。如果我尝试:

for row in rows...

我最终遍历每一行,我必须更改rows的值并在迭代它时更改列表是一个坏主意。如果我尝试:

for x in range(1, len(rows)...

我无法设计一种方法来改变当前x的位置。

3 个答案:

答案 0 :(得分:3)

如果你不能使用pandas,只需使用collections.defaultdict使用典型的分组习语:

import csv
import collections

with open("path/to/file.csv") as f:
    reader = csv.DictReader(f)
    grouped = collections.defaultdict(list)
    for row in reader:
        grouped[row['Series']].append(int(row['Value']))

这将为您提供从系列到值的便捷词典:

In [26]: grouped
Out[26]: defaultdict(list, {'3': [20, 40, 60], '4': [16, 18, 24, 42]})

如果你必须有一个元组列表:

In [28]: list(map(tuple, grouped.values()))
Out[28]: [(20, 40, 60), (16, 18, 24, 42)]

如果 要使用pandas.DataFrame,我会使用:

In [35]: [tuple(g.Value) for _,g in df.groupby('Series')]
Out[35]: [(20, 40, 60), (16, 18, 24, 42)]

在评论后编辑

因此,在详细阐述了您的问题后,有几种方法。这是一个丑陋的,使用itertools.islice来推进迭代器:

import csv
from itertools import islice

with io.StringIO(csvstring) as f:
    reader = csv.DictReader(f)
    grouped = []
    for row in reader:
        n = int(row['Series']) - 1
        val = row['Value']
        next_vals = (int(r['Value']) for r in islice(reader, n))
        grouped.append((val,)+ tuple(next_vals))

您也可以使用itertools.groupby

import itertools
import operator
import csv

with open('path/to/file.csv') as f:
    reader = csv.DictReader(f)
    grouped = itertools.groupby(reader, operator.itemgetter('Series'))
    result = []
    for _, g in grouped:
        result.append(tuple(int(r['Value']) for r in g))

结果:

In [48]: result
Out[48]: [(20, 40, 60), (16, 18, 24, 42)]

注意,仅为了说明的目的,您不需要使用itertools来执行此操作,您可以通过以下方式进行for循环:

import csv

with open('path/to/file.csv') as f:
    reader = csv.DictReader(f)
    grouped = []
    for row in reader:
        n = int(row['Series']) - 1
        val = row['Value']
        sub = [val]
        for _ in range(n):
            sub.append(int(next(reader)['Value'])) #advance the iterator using next
        grouped.append(tuple(sub))

答案 1 :(得分:2)

如何使用熊猫?

import pandas as pd

df = pd.read_csv('test.csv')
unique = tuple(df['Series'].unique())
data = [tuple(df[df.Series == i].Value) for i in unique]
print(data)

输出

[(20, 40, 60), (16, 18, 24, 42)]

答案 2 :(得分:1)

重复系列有点伤害dicts,所以只使用列表:

为数据添加重复序列....

import csv

t = """ID    Name    Series    Value
250   A       3         20
250   A       3         40
250   A       3         60
251   B       4         16
251   B       4         18
251   B       4         24
251   B       4         42
250   A       3        140
250   A       3        160"""


results = list()
tempList = list()
lastKey = None

reader = csv.DictReader(t.splitlines(), skipinitialspace=True, delimiter=' '  )
for row in reader:
    actKey = row["Series"]
    actVal = row["Value"]

    if not lastKey or lastKey != actKey: # new series starts here
        lastKey = actKey
        if tempList:                     # avoids result starting with []
            results.append(tempList)
        tempList = [actVal]              # this value goes into the new list
        continue

    tempList.append(actVal)              # same key as last one, simply add value 


if tempList:
    results.append(tempList)             # if not empty, add last ones to result 

print(results)

输出:

[['20', '40', '60'], ['16', '18', '24', '42'], ['140', '160']]