我正在编写一个应该删除重复条目的脚本。数据中的某些人输入了他们的名字两次,因为他们有两个电话号码,而且由于电话号码字段不是一个数组,要输入多个,他们输入了多个条目。
我的脚本使用与列名对应的键将条目更改为字典,然后遍历每一行。有一个循环的主循环遍历每一行,然后是一个嵌套的for循环,它遍历每个元素的所有元素,比较它们以检测重复。当我点击重复时,我的代码应该比较手机,电子邮件和网站,然后将它们附加到区域,如果它们是唯一的/不匹配的。
脚本运行,但它返回的csv填充了csv中最后一个重复8次的人,没有别的。
这是我的代码:
import csv
# This function takes a tab-delim csv and merges the ones with the same name but different phone / email / websites.
def merge_duplicates(sheet):
myjson = [] # myjson = list of dictionaries where each dictionary
with(open("ieca_first_col_fake_text.txt", "rU")) as f:
sheet = csv.DictReader(f,delimiter="\t")
for row in sheet:
myjson.append(row)
write_file = csv.DictWriter(open('duplicates_deleted.csv','w'), ['name','phone','email','website'], restval='', delimiter = '\t')
for row in myjson:
# convert phone, email, and web to lists so that extra can be appended
row['phone'] = row['phone'].split() if row.get('phone') else []
row['email'] = row['email'].split() if row.get('email') else []
row['website'] = row['website'].split() if row.get('website') else []
print row
i = 0
for i in range(len(myjson)):
# if the names match, check to see if phone, em, web match. If any match, append to first row.
try:
print 'trying'
if myjson[i]['name'] == myjson[i+1]['name']:
if myjson[i]['phone'] != myjson[i+1]['phone']:
print 'detected'
myjson[i]['phone'].append(myjson[i+1]['phone'])
if myjson[i]['email'] != myjson[i+1]['email']:
myjson[i]['email'].append(myjson[i+1]['email'])
if myjson[i]['website'] != myjson[i+1]['website']:
myjson[i]['website'].append(myjson[i+1]['website'])
except IndexError:
print("We're at the end now")
write_file.writerow(row)
print row
merge_duplicates('ieca_first_col_fake_text.txt')
这是csv输出(不是真人......组成!)
"Amy Tramy Lamy Ph.D. [] [] []"
"Amy Tramy Lamy Ph.D. [] [] []"
"Amy Tramy Lamy Ph.D. [] [] []"
"Amy Tramy Lamy Ph.D. [] [] []"
"Amy Tramy Lamy Ph.D. [] [] []"
"Amy Tramy Lamy Ph.D. [] [] []"
"Amy Tramy Lamy Ph.D. [] [] []"
"Amy Tramy Lamy Ph.D. [] [] []"
非常感谢你的帮助!
Ex数据是否有帮助:
name phone email website
Diane Grant Albrecht M.S.
"Lannister G. Cersei M.A.T., CEP" 111-222-3333 cersei@got.com www.got.com
Argle D. Bargle Ed.M.
Sam D. Man Ed.M. 000-000-1111 dman123@gmail.com www.daManWithThePlan.com
Sam D. Man Ed.M.
Sam D. Man Ed.M. 111-222-333 dman123@gmail.com www.daManWithThePlan.com
D G Bamf M.S.
Amy Tramy Lamy Ph.D.
答案 0 :(得分:2)
您的具体问题是您正在将row
写入输出csv,但在构建词典列表的for循环中使用它后,您永远不会重新分配它:
write_file.writerow(row)
这段代码有点混乱。我认为更简单的方法是使用名称OrderedDict
,假设您使用的是2.7或更高版本:
http://docs.python.org/2/library/collections.html#collections.OrderedDict
from collections import OrderedDict
people = OrderedDict()
with(open("ieca_first_col_fake_text.txt", "rU")) as f:
sheet = csv.DictReader(f,delimiter="\t")
for row in sheet:
name = row.get('name')
if name:
contact_information = people.setdefault(name, {})
contact_information.setdefault('phone', set()).add(row.get('phone'))
contact_information.setdefault('email', set()).add(row.get('email'))
contact_information.setdefault('website', set()).add(row.get('website'))
write_file = csv.DictWriter(open('duplicates_deleted.csv','w'), ['name','phone','email','website'], restval='', delimiter = '\t')
for name, contact_information in people:
row_dict = {'name': name}.update({list(contact_field) for contact_field in contact_information.values()})
write_file.writerow(row_dict)
使用Python set
类为每个唯一名称保留每个电话号码,电子邮件地址和网站的一个副本,然后将它们转换为列表以便写入您的CSV。它没有维护顺序 - 遗憾的是,OrderedSet
没有内置,但是如果你想保留它们被看到的顺序,你可以使用另一个OrderedDict
代替集合。