我正在尝试从数据中删除重复的条目,如下所示:
name phone email website
Diane Grant Albrecht M.S.
Lannister G. Cersei M.A.T., CEP 111-222-3333 cersei@got.com www.got.com
Argle D. Bargle Ed.M.
Sam D. Man Ed.M. 000-000-1111 dman123@gmail.com www.daManWithThePlan.com
Sam D. Man Ed.M.
Sam D. Man Ed.M. 111-222-333 dman123@gmail.com www.daManWithThePlan.com
D G Bamf M.S.
Amy Tramy Lamy Ph.D.
所以它看起来像这样:
name phone email website
Diane Grant Albrecht M.S.
Lannister G. Cersei M.A.T., CEP 111-222-3333 cersei@got.com www.got.com
Argle D. Bargle Ed.M.
Sam D. Man Ed.M. 000-000-1111, 111-222-333 dman123@gmail.com www.daManWithThePlan.com
D G Bamf M.S.
Amy Tramy Lamy Ph.D.
这是我的代码:
from collections import defaultdict
import csv
import re
input = open('ieca_first_col_fake_text.txt', 'rU')
# default to empty set for phone, email, website, area, degrees
extracted_data = defaultdict(lambda: [set(), set(), set()])
for row in input:
for index, value in enumerate(row):
name = row[0]
data = extracted_data[name].add(row)
for row in data: print row
我收到此错误:
AttributeError: 'list' object has no attribute 'add'
logout
更新:
from collections import defaultdict
import csv
import re
input = open('ieca_first_col_fake_text.txt', 'rU')
input_r = csv.reader(input, delimiter = '\t')
# default to empty set for phone, email, website, area, degrees
extracted_data = defaultdict(lambda: [set(), set(), set()])
data = []
# Index on the name and then for that name add the rest of the information.
for row in input_r:
data_set = extracted_data[row[0]]
for index, value in enumerate(row[1:]):
data_set[index].add(value)
print data_set
输出:
[set(['']), set(['']), set([''])]
logout
答案 0 :(得分:3)
extracted_data
值为列表,每个值为3套:
extracted_data = defaultdict(lambda: [set(), set(), set()])
您需要更仔细地阅读上一个答案并选择正确的设置以致电.add()
。
上一个答案循环输入行中的4个元素,使用第一个元素查找集合列表,并将其他3个元素中的每个元素添加到这些集合中:
for index, value in enumerate(split(entry)):
if index == 0:
data_set = extracted_data[name]
elif value:
data_set[index - 1].add(value)
就个人而言,我会使用:
entry = entry.split() # split on whitespace
for value, dset in zip(entry[1:], extracted_data[entry[0]]):
dset.add(value)
实现同样的目标。