Question

我正在从CSV文件创建一个三元组字典，并带有键-行号和值，其中包含三个整数。我还在创建另一个字典（名称），它的键是行号，值是两个字符串的列表。我想查找所有包含相同三元组的行，以防名称对不同。

到目前为止，如果两行上的三元组值相同，那么我的代码将查找所有重复项，但是如果在3行或更多行上存在重复项，它将无法正常工作。我想更新或重写整个脚本，以便在3个或更多重复的情况下检查所有名称值是否不同，并仅打印名称不同的行。例如，如果我们有以下三元组字典： triplet = {1: [111, 222, 333], 2: [111, 222, 333], 3: [111, 222, 333], }和names = {1: ['name1', 'name2'], 2: ['name1', 'name2'], 3: ['name1', 'name3']}将导致创建另一个词典：duplicated_value_keys = {(111, 222, 333): [1, 2, 3]}，并且自names[1] == names[2]起，我的脚本将不会显示重复项，但原则上应打印三元组值第2行和第3行上的名称不同。

for csv_infile in os.listdir(input_dir):
        if csv_infile.lower().endswith('.csv'):
            csv_in = os.path.join(input_dir, csv_infile)
            with open(csv_in) as f_in:
                # Creating dictionaries containing as a key the line number and as a value
                triplet = {}
                names = {}
                l_num = 0
                for line in f_in:
                    l_num += 1
                    triplet[l_num] = [(line.split('\t')[1]), (line.split('\t')[2]), (line.split('\t')[3])]
                    names[l_num] = [(line.split('\t')[4].lower().strip()), (line.split('\t')[5].lower().strip())]

                # Finding the duplicated values and creating a new dictionary with values the line numbers.
                duplicated_value_keys = collections.defaultdict(list)
                for key, value in triplet.items():
                    duplicated_value_keys[tuple(value)].append(key)
                for duplicated_keys in duplicated_value_keys.values():
                    if len(duplicated_keys) >1 and names[duplicated_keys[0]] != names[duplicated_keys[1]]: 
                        print("There is a duplicated triplet on lines: {}.\n".format(', '.join(map(str, duplicated_keys))))

[EDIT]：CSV输入文件具有以下格式，并且用制表符分隔：

2       8004    3014    3       test name   1       14080   1       0       3478    1572    0       0
2       8004    3014    3       test name    1       8004    1       0       3478    1572    0       0
3       8004    3014    3       test name1   1       8004    1       0       3477    1571    0       0

Answer 1

可以使用defaultdict(list)检测重复的行。三元组将是字典的键，每个三元组将包含找到三元组的行号和名称的列表。读完所有条目后，遍历字典并仅显示包含不同名称的那些条目。例如：

import csv
from collections import defaultdict

triplets = defaultdict(list)

with open('test.csv', newline='') as f_input:
    csv_input = csv.reader(f_input, delimiter='\t')

    for line, row in enumerate(csv_input, start=1):
        triplets[tuple(row[1:4])].append((line, list(map(str.lower, row[4:6]))))

for triplet, entries in sorted(triplets.items()):
    if len(entries) > 1 and len({tuple(names) for line, names in entries}) > 1:
        print("Duplicate triplet: {} on lines:".format(triplet))
        for line, names in entries:
            print("  {}, {}, {}".format(line, *names))
        print()

对于给定的test.csv，这将产生：

Duplicate triplet: ('13115', '3209', '3') on lines:
  44, skylink, horor film
  69, skylink, private spice

Duplicate triplet: ('13139', '3219', '3') on lines:
  8, skylink, nova cinema
  13, skylink, prima zoom

Duplicate triplet: ('8004', '3014', '3') on lines:
  2, skylink, ct 2
  3, skylink, bar 2
  4, skylink, tst 22
  5, skylink, tst 22

在字典中查找重复的值并仅在具有相同键的值不同的情况下打印它们

1 个答案: