Question

我有+20 000个文件，如下所示，都在同一个目录中：

8003825.pdf
8003825.tif
8006826.tif

如何找到所有重复的文件名，同时忽略文件扩展名。

澄清：我将副本称为具有相同文件名的文件，同时忽略文件扩展名。如果文件不是100％相同（例如hashsize或类似的东西），我会不关心

例如：

"8003825" appears twice

然后查看每个重复文件的元数据，只保留最新文件。

与此帖类似：

Keep latest file and delete all other

我想我必须创建所有文件的列表，检查文件是否已存在。如果是，那么使用os.stat来确定修改日期？

我有点担心将所有文件名加载到内存中。并且想知道是否有更多的pythonic做事方式......

Python 2.6 Windows 7

Answer 1

您可以O(n)复杂度来执行此操作。 sort的解决方案具有O(n*log(n))复杂度。

import os
from collections import namedtuple

directory = #file directory
os.chdir(directory)

newest_files = {}
Entry = namedtuple('Entry',['date','file_name'])

for file_name in os.listdir(directory):
    name,ext = os.path.splitext(file_name)
    cashed_file = newest_files.get(name)
    this_file_date = os.path.getmtime(file_name)
    if cashed_file is None:
        newest_files[name] = Entry(this_file_date,file_name)
    else:
        if this_file_date > cashed_file.date: #replace with the newer one
            newest_files[name] = Entry(this_file_date,file_name)

newest_files是一个dictonary，其文件名没有扩展名为具有命名元组值的键，其中包含文件的完整文件名和修改日期。如果遇到的新文件位于字典中，则将其日期与存储在字典中的日期进行比较，并在必要时将其替换。

最后，你有一本包含最新文件的字典。

然后您可以使用此列表执行第二次传递。请注意，字典中的查找复杂性为O(1)。因此，查找字典中所有n文件的总体复杂性为O(n)。

例如，如果您只想保留具有相同名称的最新文件并删除另一个，可以通过以下方式实现：

for file_name in os.listdir(directory):
    name,ext = os.path.splitext(file_name)
    cashed_file_name = newest_files.get(name).file_name
    if file_name != cashed_file_name: #it's not the newest with this name
        os.remove(file_name)

根据评论中 Blckknght 的建议，您甚至可以避免第二次传递并在遇到较新文件时立即删除旧文件，只需添加一行代码：

    else:
        if this_file_date > cashed_file.date: #replace with the newer one
            newest_files[name] = Entry(this_file_date,file_name)
            os.remove(cashed_file.file_name) #this line added

Answer 2

首先，获取文件名列表并对其进行排序。这将使任何重复彼此相邻。

然后，剥离文件扩展名并与邻居进行比较，os.path.splitext()和itertools.groupby()在此处可能很有用。

对重复项进行分组后，使用os.stat()选择要保留的副本。

最后，您的代码可能如下所示：

import os, itertools

files = os.listdir(base_directory)
files.sort()
for k, g in itertools.groupby(files, lambda f: os.path.splitext(f)[0]):
     dups = list(g)
     if len(dups) > 1:
         # figure out which file(s) to remove

你不必担心这里的内存，你正在寻找几兆字节的东西。

Answer 3

对于文件名计数器，您可以使用defaultdict来存储每个文件出现的次数：

import os
from collections import defaultdict

counter = defaultdict(int)
for file_name in file_names:
   file_name = os.path.splitext(os.path.basename(file_name))[0]
   counter[file_name] += 1

找到重复的文件名，只使用python保存最新的文件

3 个答案: