Question

我有一个人和他们的投票地点的数据集（csv），如：

**Person | Voting Place | VP address | ....**

John Doe | Zoo | 123 fake street | ....

Jane Doe | Zoo | 123 fake street | ....

Joey Ramone | Park | 814 Real Street | ...

我希望对这些数据进行规范化，以便将位置拉出到单独的列表中，重复数据删除并分配一个任意ID＃。然后将人员存储在单独的文件中，其中引用了投票位置ID＃而不是实际信息。

我理解如何使用python集重复数据删除列的组合并在他们自己的文件中分解它们。我不明白的是如何获取/分配SET（）列表中每个元素的ID，我可以用它以后再参考它？这可以通过csv在一次迭代中完成，以便伪代码：

for row in file:
    person = [row[0], row[1]]
    voting_location = [row[2],row[3]]
    if voting_location not in unique_set:
        add to set
        get ID of element in set
        write location line in location file
    else: # location already in list so its a duplicate
        get id of location already in list
    append id to person_list
    write person line in person file

有没有办法在纯python / csv中执行此操作，还是需要启动适当的关系数据库才能完成工作？

Answer 1

你可以使用字典。使用投票地点作为关键字和相应的值，列出在那里注册的选民：

import csv
from collections import OrderedDict


data = OrderedDict()
with open('input.txt') as f:
    reader = csv.reader(f, delimiter='|')
    for row in reader:
        row = [e.strip() for e in row]
        person   = row[0]
        location = (row[1], row[2])

        if location not in data: data[location] = []

        data[location].append(person)

# Show voting places
print("Voting places (voting_place_id, voting_place):")
for (i,k) in enumerate(data):
    print("  %3d %s" % (i,k))
print("")

# Show voters
print("Voters (voting_place_id, person):")
for (i,k) in enumerate(data):
    for p in data[k]:
        print("  %3d %s" % (i,p))
print("")

输出：

Voting places (voting_place_id, voting_place):
    0 ('Voting Place', 'VP address')
    1 ('Zoo', '123 fake street')
    2 ('Park', '814 Real Street')

Voters (voting_place_id, person):
    0 Person
    1 John Doe
    1 Jane Doe
    2 Joey Ramone

在此脚本的运行之间没有保存状态，因此如果您使用一半数据集运行一次，然后再使用其余数据运行，则将重新使用相同的“地点ID”，而不考虑ID是第一次运行。

但是，如果将数据附加到原始数据，并再次运行该程序，则第一次运行时生成的ID将与第二次运行时生成的ID匹配，前提是行中没有任何更改这是第一次出现（这就是为什么我们使用OrderedDict代替dict）。

如果你想要持久状态，你可以在运行之间始终pickle和unpickle data字典。或者将有序键转储到文件中，并使用这些键初始化data字典。

如何将CSV列表标准化为2个或更多单独的文件？

1 个答案: