Question

我以120000行的方式获得了以下方式的csv：

ID Duplicate
1 65
2 67
4 12
4 53
4 101
12 4
12 53
101 ...

此列表基本上指定了许多用户ID，以及与该用户重复的用户。如何组成列表我现在无法在Excel中对此进行过滤，因此我尝试使用此结果转换列表：

[1, 65]
[2, 67]
[4, 12, 53, 101]

之后我将能够为每个元素写入一个新的csv删除列表[0]，这样我就可以为每个“重复用户块”保留一个用户。在Excel中，我将删除所有剩余的用户ID。

然而，到目前为止，我遇到了一些问题：

import csv

with open("contacts.csv", "rt") as f:
    reader = csv.reader(f, delimiter="\t")

    contacts = []
    for row in reader:
        if row[0] not in contacts:
            contacts.append(row[0])
        if row[1] not in contacts:
            position = contacts.index(row[0])
            contacts[position].append(row[1])

当然我收到错误“AttributeError：'str'对象没有属性'append'”，因为contacts [position]是一个字符串。但是我如何更改代码，以便为每个重复联系人块获取一个列表？

谢谢！

Answer 1

标准python中几乎有一个班轮

import csv
from itertools import groupby

with open("contacts.csv", "rt") as f:
    reader = csv.reader(f, delimiter="\t")
    contacts = [[k] + [r[1] for r in g] for k, g in groupby(reader, key=lambda row: row[0])]

我也喜欢熊猫解决方案，但这意味着要学习新的api。

Answer 2

即使你的csv文件没有排序，如果你错过了一些条目，这也会有效：

with open('contacts.csv') as infile:
    data = {}
    for i,dup in csv.reader(infile):
        if i not in data:
            if dup in data:
                data[i] = data[dup]
                continue

            data[i] = set((i, dup))
            data[dup] = data[i]
            continue

        data[i].add(dup)

for _,dups in data.items():
    print(sorted(dups))

更新：如果您想避免打印多套重复项：

for k,dups in list(data.items()):
    if k not in data: continue
    print(sorted(dups))
    for d in dups: data.pop(d)

重新排序重复的联系人。列表问题

2 个答案: