Question

给出文件路径列表：

my_file_list = ['a.txt','b.txt','c.txt','d.txt']

我想将每个文件与列表中的其余文件进行比较，然后删除重复文件的路径。

因此，如果b.txt与c.txt相同，那么我的列表应该是

my_file_list = ['a.txt','b.txt','d.txt']

这种情况唯一的另一个挑战是所有4个文件都在一个zip文件中，我们将其命名为files.zip。

因此，最好在zip文件中导航并访问每个文件并执行filecmp或只是从文件中提取文本并进行文本比较并确定并删除重复文件？

在Python 3中执行此操作的最有效方法是什么？

Answer 1

#!/usr/bin/python3
#!Py 3.6.1

import os
import filecmp

location = '.'
my_file_list = []

for filename in os.listdir(location):
    if filename.endswith('.txt'):
        my_file_list.append(filename)

print(my_file_list)
# ['b.txt', 'a.txt', 'c.txt', 'd.txt']

for i in range(0,len(my_file_list)-2):
    for j in range(i,len(my_file_list)-1):
        if filecmp.cmp(my_file_list[i],my_file_list[j],shallow=True):
            my_file_list.pop(j)

print(my_file_list)
# ['b.txt', 'a.txt', 'd.txt']

替代代码：

#!/usr/bin/python3
#!Py 3.6.1

import os
import filecmp

location = '.'
my_file_list = []

# Retrieve the files from the especified location
for filename in os.listdir(location):
    if filename.endswith('.txt'):
        my_file_list.append(filename)

# Sort the files 
my_file_list.sort()

print(my_file_list)
# ['a.txt', 'b.txt', 'c.txt', 'd.txt', 'e.txt', 'f.txt']
#  b.txt and c.txt are duplicated and also
#  d.txt and e.txt are equals

# remove from my_file_list duplicated files
i=0
while i < len(my_file_list):
    for j in range(0,len(my_file_list)):
        if (filecmp.cmp(my_file_list[i],my_file_list[j],shallow=True) and i!=j):
            my_file_list.pop(j)
            break
    i=i+1

print(my_file_list)
# Results are only the first unique files
# ['a.txt', 'b.txt', 'd.txt', 'f.txt']

Answer 2

使用设置是合适的。 https://docs.python.org/3/tutorial/datastructures.html#sets

从python 3中的文件路径列表中删除重复的文件

2 个答案: