我正在从CSV文件创建一个三元组字典,并带有键-行号和值,其中包含三个整数。我还在创建另一个字典(名称),它的键是行号,值是两个字符串的列表。我想查找所有包含相同三元组的行,以防名称对不同。
到目前为止,如果两行上的三元组值相同,那么我的代码将查找所有重复项,但是如果在3行或更多行上存在重复项,它将无法正常工作。我想更新或重写整个脚本,以便在3个或更多重复的情况下检查所有名称值是否不同,并仅打印名称不同的行。例如,如果我们有以下三元组字典:
triplet = {1: [111, 222, 333], 2: [111, 222, 333], 3: [111, 222, 333], }
和names = {1: ['name1', 'name2'], 2: ['name1', 'name2'], 3: ['name1', 'name3']}
将导致创建另一个词典:duplicated_value_keys = {(111, 222, 333): [1, 2, 3]}
,并且自names[1] == names[2]
起,我的脚本将不会显示重复项,但原则上应打印三元组值第2行和第3行上的名称不同。
for csv_infile in os.listdir(input_dir):
if csv_infile.lower().endswith('.csv'):
csv_in = os.path.join(input_dir, csv_infile)
with open(csv_in) as f_in:
# Creating dictionaries containing as a key the line number and as a value
triplet = {}
names = {}
l_num = 0
for line in f_in:
l_num += 1
triplet[l_num] = [(line.split('\t')[1]), (line.split('\t')[2]), (line.split('\t')[3])]
names[l_num] = [(line.split('\t')[4].lower().strip()), (line.split('\t')[5].lower().strip())]
# Finding the duplicated values and creating a new dictionary with values the line numbers.
duplicated_value_keys = collections.defaultdict(list)
for key, value in triplet.items():
duplicated_value_keys[tuple(value)].append(key)
for duplicated_keys in duplicated_value_keys.values():
if len(duplicated_keys) >1 and names[duplicated_keys[0]] != names[duplicated_keys[1]]:
print("There is a duplicated triplet on lines: {}.\n".format(', '.join(map(str, duplicated_keys))))
[EDIT]:CSV输入文件具有以下格式,并且用制表符分隔:
2 8004 3014 3 test name 1 14080 1 0 3478 1572 0 0
2 8004 3014 3 test name 1 8004 1 0 3478 1572 0 0
3 8004 3014 3 test name1 1 8004 1 0 3477 1571 0 0
答案 0 :(得分:1)
可以使用defaultdict(list)
检测重复的行。三元组将是字典的键,每个三元组将包含找到三元组的行号和名称的列表。读完所有条目后,遍历字典并仅显示包含不同名称的那些条目。例如:
import csv
from collections import defaultdict
triplets = defaultdict(list)
with open('test.csv', newline='') as f_input:
csv_input = csv.reader(f_input, delimiter='\t')
for line, row in enumerate(csv_input, start=1):
triplets[tuple(row[1:4])].append((line, list(map(str.lower, row[4:6]))))
for triplet, entries in sorted(triplets.items()):
if len(entries) > 1 and len({tuple(names) for line, names in entries}) > 1:
print("Duplicate triplet: {} on lines:".format(triplet))
for line, names in entries:
print(" {}, {}, {}".format(line, *names))
print()
对于给定的test.csv
,这将产生:
Duplicate triplet: ('13115', '3209', '3') on lines:
44, skylink, horor film
69, skylink, private spice
Duplicate triplet: ('13139', '3219', '3') on lines:
8, skylink, nova cinema
13, skylink, prima zoom
Duplicate triplet: ('8004', '3014', '3') on lines:
2, skylink, ct 2
3, skylink, bar 2
4, skylink, tst 22
5, skylink, tst 22