如何修改csv文件中的重复字段?

时间:2017-12-22 22:42:59

标签: python python-3.x python-2.7

我想在csv文件中修改字段电子邮件,例如mycsv_file.csv

john@gmail.com
mary@gmail.com
klarck@gmail.com
ralf@gmail.com
john@gmail.com
mary@gmail.com
klarck@gmail.com

读取csv文件的代码:

import csv

with open('mycsv_file.csv', 'r') as csv_file: 
     spamreader = csv.reader(csv_file)
     for line in spamreader:
         ord = next.spamreader
         for k in ored:       
            if line[0]==k[0]:
               line[0]==????

我想要的结果:

john@gmail.com
mary@gmail.com
klarck@gmail.com
ralf@gmail.com
john1@gmail.com
mary1@gmail.com
klarck1@gmail.com

3 个答案:

答案 0 :(得分:2)

我会跟踪字典结构中的已知地址,如果我之前看到它,请将数字附加到地址。

此解决方案将跟踪以前的地址,如果之前已经看过,则会向他们附加一个数字。

addresses = []  # [ "user@host.com"]
known_addresses = {}  # { "user@host.com": 0 }

with open('mycsv_file.csv', 'r') as csv_file:
    reader = csv.reader(csv_file)
    for line in reader:
        address = line[0]
        if address in known_addresses:
            known_addresses[address] += 1
            email, host = address.split("@")
            number = str(known_addresses[address])
            address = email + number + '@' + host
        else:
            known_addresses[address] = 0
        addresses.append(address)

但是,它不知道列表后面是否会出现递增的地址,因此可能仍然存在重复。

例如,如果您的列表是

mary@gmail.com
mary@gmail.com
mary1@gmail.com

你得到了输出

mary@gmail.com
mary1@gmail.com
mary1@gmail.com

如果您想确保处理后所有地址都是唯一的,而不丢失原始地址集中的任何地址,您可以读取所有地址并处理它们以增加任何重复数据。

# all read addresses from file, keeping track of duplication
addresses = {} # { "user@host.com": 0 }

# addresses which have had duplication removed
processed_addresses = set()s

with open('mycsv_file.csv', 'r') as csv_file:
    reader = csv.reader(csv_file)
    for line in reader:
        address = line[0]
        if address in addresses:
            addresses[address] += 1
        else:
            addresses[address] = 1

for address, count in addresses.items(): # .iteritems() if python 2.7
    num = 1
    for _ in range(count):
        if address not in processed_addresses:
            processed_addresses.add(address)
        else:
            parts = address.split('@')
            added = False
            while not added:
                tentative_address = parts[0] + str(num) + '@' + parts[1]
                if tentative_address not in processed_addresses:
                    processed_addresses.add(tentative_address)
                    added = True
                num += 1

给出输入

mary@gmail.com
mary@gmail.com
mary1@gmail.com

这会产生

mary@gmail.com
mary1@gmail.com
mary11@gmail.com

如果您需要地址列表,可以使用以下功能将已处理的条目集转换为列表。

addresses = list(processed_addresses)

答案 1 :(得分:2)

您可以使用collections.Counter来跟踪到目前为止看到的电子邮件地址的次数,并知道要附加哪个数字作为后缀以使其唯一。为了说明这一点,我在示例输入的末尾添加了一行,现在是:

john@gmail.com
mary@gmail.com
klarck@gmail.com
ralf@gmail.com
john@gmail.com
mary@gmail.com
klarck@gmail.com
mary@gmail.com,third occurrence

以下是代码:

import csv
from collections import Counter

# Note: For Python 2.x, use "open('mycsv_file.csv', 'rb')" below.
with open('mycsv_file.csv', 'r', newline='') as csv_file:
     occurrences = Counter()
     for line in csv.reader(csv_file):
         email = line[0]
         if email in occurrences:
            head, tail = email.split('@')
            print('{}@{}'.format(head+str(occurrences[email]), tail))
            occurrences[email] += 1
         else:
            print('{}'.format(email))
            occurrences[email] = 1

输出(注意最后生成的mary2@gmail.com,因为它已经被看过两次了):

john@gmail.com
mary@gmail.com
klarck@gmail.com
ralf@gmail.com
john1@gmail.com
mary1@gmail.com
klarck1@gmail.com
mary2@gmail.com

答案 2 :(得分:1)

在一个循环中读取,检查和写入新文件。

O(nm)