我想在T1中读取并将其写为T2(注意两者都是.csv)。 T1包含重复的行;我不想在T2中写重复。
T1
+------+------+---------+---------+---------+
| Type | Year | Value 1 | Value 2 | Value 3 |
+------+------+---------+---------+---------+
| a | 8 | x | y | z |
| b | 10 | q | r | s |
+------+------+---------+---------+---------+
T2
+------+------+---------+-------+
| Type | Year | Value # | Value |
+------+------+---------+-------+
| a | 8 | 1 | x |
| a | 8 | 2 | y |
| a | 8 | 3 | z |
| b | 10 | 1 | q |
| ... | ... | ... | ... |
+------+------+---------+-------+
目前,我有这个极其缓慢的代码来过滤重复项:
no_dupes = []
for row in reader:
type = row[0]
year = row[1]
index = type,age
values_list = row[2:]
if index not in no_dupes:
for i,j in enumerate(values_list):
line = [type, year, str(i+1), str(j)]
writer.writerow(line) #using csv module
no_dupes.append(index)
当T1变大时,我无法解释这段代码的速度有多慢。
当我写入T2时,是否有更快的方法从T1过滤掉重复项?
答案 0 :(得分:4)
我想你想要这样的东西:
no_dupes = set()
for row in reader:
type, year = row[0], row[1]
values_list = row[2:]
for index, value in enumerate(values_list, start=1):
line = (type, year, index, value)
no_dupes.add(line)
for t in no_dupes:
writer.writerow(t)
答案 1 :(得分:0)
如果可能的话,将阅读器转换为集合并迭代集合,则不存在重复的可能性