我有一个包含以下内容的CSV:
ID Name Series Value
250 A 3 20
250 A 3 40
250 A 3 60
251 B 4 16
251 B 4 18
251 B 4 24
251 B 4 42
Series
列表示有多少元素彼此属于,因此取第一行(不是标题行)Series = 3
。所以我需要收集Series
指定的行数,包括当前行。这样它们就像这样分组(Value
):
[(20, 40, 60), (16, 18, 24, 42)]
基本上,我按顺序向下移动CSV,而Series
告诉我要收集多少下一行(包括当前行)。如果我们再次使用第一行,则值为3
,因此我的分组必须从当前行开始总计3
行。
我已经在CSV中读取并将其从Reader
转换为List
,但是无法根据找到的值主动更改行上的迭代索引的解决方案在系列中。如果我尝试:
for row in rows...
我最终遍历每一行,我必须更改rows
的值并在迭代它时更改列表是一个坏主意。如果我尝试:
for x in range(1, len(rows)...
我无法设计一种方法来改变当前x
的位置。
答案 0 :(得分:3)
如果你不能使用pandas,只需使用collections.defaultdict
使用典型的分组习语:
import csv
import collections
with open("path/to/file.csv") as f:
reader = csv.DictReader(f)
grouped = collections.defaultdict(list)
for row in reader:
grouped[row['Series']].append(int(row['Value']))
这将为您提供从系列到值的便捷词典:
In [26]: grouped
Out[26]: defaultdict(list, {'3': [20, 40, 60], '4': [16, 18, 24, 42]})
如果你必须有一个元组列表:
In [28]: list(map(tuple, grouped.values()))
Out[28]: [(20, 40, 60), (16, 18, 24, 42)]
如果 要使用pandas.DataFrame
,我会使用:
In [35]: [tuple(g.Value) for _,g in df.groupby('Series')]
Out[35]: [(20, 40, 60), (16, 18, 24, 42)]
因此,在详细阐述了您的问题后,有几种方法。这是一个丑陋的,使用itertools.islice
来推进迭代器:
import csv
from itertools import islice
with io.StringIO(csvstring) as f:
reader = csv.DictReader(f)
grouped = []
for row in reader:
n = int(row['Series']) - 1
val = row['Value']
next_vals = (int(r['Value']) for r in islice(reader, n))
grouped.append((val,)+ tuple(next_vals))
您也可以使用itertools.groupby
:
import itertools
import operator
import csv
with open('path/to/file.csv') as f:
reader = csv.DictReader(f)
grouped = itertools.groupby(reader, operator.itemgetter('Series'))
result = []
for _, g in grouped:
result.append(tuple(int(r['Value']) for r in g))
结果:
In [48]: result
Out[48]: [(20, 40, 60), (16, 18, 24, 42)]
注意,仅为了说明的目的,您不需要使用itertools来执行此操作,您可以通过以下方式进行for循环:
import csv
with open('path/to/file.csv') as f:
reader = csv.DictReader(f)
grouped = []
for row in reader:
n = int(row['Series']) - 1
val = row['Value']
sub = [val]
for _ in range(n):
sub.append(int(next(reader)['Value'])) #advance the iterator using next
grouped.append(tuple(sub))
答案 1 :(得分:2)
如何使用熊猫?
import pandas as pd
df = pd.read_csv('test.csv')
unique = tuple(df['Series'].unique())
data = [tuple(df[df.Series == i].Value) for i in unique]
print(data)
输出
[(20, 40, 60), (16, 18, 24, 42)]
答案 2 :(得分:1)
重复系列有点伤害dicts,所以只使用列表:
为数据添加重复序列....
import csv
t = """ID Name Series Value
250 A 3 20
250 A 3 40
250 A 3 60
251 B 4 16
251 B 4 18
251 B 4 24
251 B 4 42
250 A 3 140
250 A 3 160"""
results = list()
tempList = list()
lastKey = None
reader = csv.DictReader(t.splitlines(), skipinitialspace=True, delimiter=' ' )
for row in reader:
actKey = row["Series"]
actVal = row["Value"]
if not lastKey or lastKey != actKey: # new series starts here
lastKey = actKey
if tempList: # avoids result starting with []
results.append(tempList)
tempList = [actVal] # this value goes into the new list
continue
tempList.append(actVal) # same key as last one, simply add value
if tempList:
results.append(tempList) # if not empty, add last ones to result
print(results)
输出:
[['20', '40', '60'], ['16', '18', '24', '42'], ['140', '160']]