我想在csv文件中修改字段电子邮件,例如mycsv_file.csv
:
john@gmail.com
mary@gmail.com
klarck@gmail.com
ralf@gmail.com
john@gmail.com
mary@gmail.com
klarck@gmail.com
读取csv文件的代码:
import csv
with open('mycsv_file.csv', 'r') as csv_file:
spamreader = csv.reader(csv_file)
for line in spamreader:
ord = next.spamreader
for k in ored:
if line[0]==k[0]:
line[0]==????
我想要的结果:
john@gmail.com
mary@gmail.com
klarck@gmail.com
ralf@gmail.com
john1@gmail.com
mary1@gmail.com
klarck1@gmail.com
答案 0 :(得分:2)
我会跟踪字典结构中的已知地址,如果我之前看到它,请将数字附加到地址。
此解决方案将跟踪以前的地址,如果之前已经看过,则会向他们附加一个数字。
addresses = [] # [ "user@host.com"]
known_addresses = {} # { "user@host.com": 0 }
with open('mycsv_file.csv', 'r') as csv_file:
reader = csv.reader(csv_file)
for line in reader:
address = line[0]
if address in known_addresses:
known_addresses[address] += 1
email, host = address.split("@")
number = str(known_addresses[address])
address = email + number + '@' + host
else:
known_addresses[address] = 0
addresses.append(address)
但是,它不知道列表后面是否会出现递增的地址,因此可能仍然存在重复。
例如,如果您的列表是
mary@gmail.com
mary@gmail.com
mary1@gmail.com
你得到了输出
mary@gmail.com
mary1@gmail.com
mary1@gmail.com
如果您想确保处理后所有地址都是唯一的,而不丢失原始地址集中的任何地址,您可以读取所有地址并处理它们以增加任何重复数据。
# all read addresses from file, keeping track of duplication
addresses = {} # { "user@host.com": 0 }
# addresses which have had duplication removed
processed_addresses = set()s
with open('mycsv_file.csv', 'r') as csv_file:
reader = csv.reader(csv_file)
for line in reader:
address = line[0]
if address in addresses:
addresses[address] += 1
else:
addresses[address] = 1
for address, count in addresses.items(): # .iteritems() if python 2.7
num = 1
for _ in range(count):
if address not in processed_addresses:
processed_addresses.add(address)
else:
parts = address.split('@')
added = False
while not added:
tentative_address = parts[0] + str(num) + '@' + parts[1]
if tentative_address not in processed_addresses:
processed_addresses.add(tentative_address)
added = True
num += 1
给出输入
mary@gmail.com
mary@gmail.com
mary1@gmail.com
这会产生
mary@gmail.com
mary1@gmail.com
mary11@gmail.com
如果您需要地址列表,可以使用以下功能将已处理的条目集转换为列表。
addresses = list(processed_addresses)
答案 1 :(得分:2)
您可以使用collections.Counter
来跟踪到目前为止看到的电子邮件地址的次数,并知道要附加哪个数字作为后缀以使其唯一。为了说明这一点,我在示例输入的末尾添加了一行,现在是:
john@gmail.com
mary@gmail.com
klarck@gmail.com
ralf@gmail.com
john@gmail.com
mary@gmail.com
klarck@gmail.com
mary@gmail.com,third occurrence
以下是代码:
import csv
from collections import Counter
# Note: For Python 2.x, use "open('mycsv_file.csv', 'rb')" below.
with open('mycsv_file.csv', 'r', newline='') as csv_file:
occurrences = Counter()
for line in csv.reader(csv_file):
email = line[0]
if email in occurrences:
head, tail = email.split('@')
print('{}@{}'.format(head+str(occurrences[email]), tail))
occurrences[email] += 1
else:
print('{}'.format(email))
occurrences[email] = 1
输出(注意最后生成的mary2@gmail.com
,因为它已经被看过两次了):
john@gmail.com
mary@gmail.com
klarck@gmail.com
ralf@gmail.com
john1@gmail.com
mary1@gmail.com
klarck1@gmail.com
mary2@gmail.com
答案 2 :(得分:1)
在一个循环中读取,检查和写入新文件。
O(nm)