如何使用python从数千个文件的目录中复制前100个文件?

时间:2015-07-15 09:28:32

标签: python

我有一个庞大的目录,一直在不断更新。我试图使用python列出目录中最新的100个文件。我尝试使用os.listdir(),但是当目录的大小接近1,00,000个文件时,似乎listdir()崩溃(或者我没有等待足够长的时间)。我只需要前100个文件(或文件名)进行进一步处理,所以我不希望listdir()填充所有100000个文件。有没有一种在Python中执行此操作的好方法?

PS:我是编程新手

1 个答案:

答案 0 :(得分:1)

以下是关于如何按文件遍历大型目录文件的答案!

我像疯子一样搜索了一个Windows DLL,它允许我在Linux上做了什么,但没有运气。

所以,我得出结论,唯一的方法是创建我自己的DLL,将这些静态函数暴露给我,但后来我记得pywintypes。 而且,是的!这已经在那里完成了。而且,迭代器功能已经实现了!酷!

使用FindFirstFile(),FindNextFile()和FindClose()的Windows DLL可能仍然在某处,但我没有找到它。所以,我使用了pywintypes。

编辑: 我发现(很晚)这些函数可以从kernel32.dll获得。整个时间都藏在我的鼻子前面。

对不起依赖。但是我认为你可以从... \ site-packages \ win32文件夹和最终的依赖项中提取win32file.pyd,并且如果你需要,可以使用你的程序独立于win32types分发它。

正如您将从速度测试中看到的那样,返回发电机非常快。

在此之后,您将能够逐个文件地执行任何操作。

NOTE: win32file.FindFilesIterator() returns whole stat of the file/dir, therefore, using my listdir() to get the name and afterwards os.path.get*time() or os.path.is*() doesn't make sense. Better modify my listdir() for those checks.

现在,为您的问题获得完整的解决方案仍然存在问题。

对你来说,坏消息是它从它喜欢的目录中的第一项开始,你无法选择它将是哪一项。在我的测试中,它总是返回已排序的目录。 (在Windows上)

半好消息是,您可以在Windows上使用通配符来控制列出的文件。因此,要在不断填充的目录中使用它,您可以使用版本标记新的文件,并执行以下操作:

bunch = 1
while True:
    for file in listdir("mydir\\*bunch%i*" % bunch): print file
    sleep(5); bunch += 1

但是你必须非常巧妙地设计它,否则你会有文件到达,但你找不到它们因为它们已经迟到了。

我不知道如果在循环转换之间引入延迟,FindFilesIterator()是否会继续检测新文件。

如果确实如此,这也可能是您的解决方案。

您可以提前创建一个迭代器,然后调用next()方法获取下一个文件:

i = listdir(".")
while True:
    try: name = i.next()
    except StopIteration: sleep(1)
# This probably won't work as imagined though

您可以根据上次到达文件的大小决定等待新文件的时间。狂野地猜测所有传入的文件大小大致相同或者减去一些东西。

但是,win32file为您提供了一些功能,可以帮助您监视目录中的更改,我认为这是您最好的选择。

在速度测试中你也可以看到从这个迭代器构造一个列表比调用os.listdir()慢,但是os.listdir()会阻塞,我的listdir()不会。 其目的不是创建文件列表。为什么这个速度损失出现我不知道。只能猜测DLL调用,列表构造,排序或类似的东西。 os.listdir()完全用C语言编写。

如果名称 ==“主要”阻止,您可以看到一些用法。将代码保存在listdir.py和'from listdir import *'中。

Here is the code:


#! /usr/bin/env python

"""
An equivalent of os.listdir() but as a generator using ctypes on 
Unixoides and pywintypes on Windows.

On Linux there is shared object libc.so that contains file manipulation 
functions we need: opendir(), readdir() and closedir().
On Windows those manipulation functions are provided 
by static library header windows.h. As pywintypes is a wrapper around 
this API we will use it.
kernel32.dll contains FindFirstFile(), FindNextFile() and FindClose() as well and they can be used directly via ctypes.

The Unix version of this code is an adaptation of code provided by user
'jason-orendorff' on Stack Overflow answering a question by user 'adrien'.
The original URL is:
http://stackoverflow.com/questions/4403598/list-files-in-a-folder-as-a-stream-to-begin-process-immediately

The Unix code is tested on Raspbian for now and it works. A reasonable 
conclusion is that it'll work on all Debian based distros as well.

NOTE: dirent structure is not the same on all distros, so the code will break on some of them.

The code is also tested on Cygwin using cygwin1.dll and it 
doesn't work.

If platform isn't Windows or Posix environment, listdir will be 
redirected back to os.listdir().

NOTE: There is scandir module implementing this code with no dependencies, excellent error handling and portability. I found it only after putting together this code. scandir() is now included in standardlib of Python 3.5 as os.scandir().
You definitely should use scandir, not this code.
Scandir module is available on pypi.python.org.
"""

import sys, os

__all__ = ["listdir"]

if sys.platform.startswith("win"):
    from win32file import FindFilesIterator

    def listdir (path):
        """
        A generator to return the names of files in the directory passed in
        """
        if "*" not in path and "?" not in path:
            st = os.stat(path) # Raise an error if dir doesn't exist or access is denied to us
            # Check if we got a dir or something else!
            # Check gotten from stat.py (for fast checking):
            if (st.st_mode & 0170000) != 0040000:
                e = OSError()
                e.errno = 20; e.filename = path; e.strerror = "Not a directory"
                raise e
            path = path.rstrip("\\/")+"\\*"
        # Else:  Decide that user knows what she/he is doing
        for file in FindFilesIterator(path):
            name = file[-2]
            # Unfortunately, only drives (eg. C:) don't include "." and ".." in the list:
            if name=="." or name=="..": continue
            yield name

elif os.name=="posix":
    if not sys.platform.startswith("linux"):
        print >> sys.stderr, "WARNING: Environment is Unix but platform is '"+sys.platform+"'\nlistdir() may not work properly."
    from ctypes import CDLL, c_char_p, c_int, c_long, c_ushort, c_byte, c_char, Structure, POINTER
    from ctypes.util import find_library

    class c_dir(Structure):
        """Opaque type for directory entries, corresponds to struct DIR"""
        pass

    c_dir_p = POINTER(c_dir)

    class c_dirent(Structure):
        """Directory entry"""
        # FIXME not sure these are the exactly correct types!
        _fields_ = (
            ('d_ino', c_long), # inode number
            ('d_off', c_long), # offset to the next dirent
            ('d_reclen', c_ushort), # length of this record
            ('d_type', c_byte), # type of file; not supported by all file system types
            ('d_name', c_char * 4096) # filename
            )

    c_dirent_p = POINTER(c_dirent)

    c_lib = CDLL(find_library("c"))
    # Extract functions:
    opendir = c_lib.opendir
    opendir.argtypes = [c_char_p]
    opendir.restype = c_dir_p

    readdir = c_lib.readdir
    readdir.argtypes = [c_dir_p]
    readdir.restype = c_dirent_p

    closedir = c_lib.closedir
    closedir.argtypes = [c_dir_p]
    closedir.restype = c_int

    def listdir(path):
        """
        A generator to return the names of files in the directory passed in
        """
        st = os.stat(path) # Raise an error if path doesn't exist or we don't have permission to access it
        # Check if we got a dir or something else!
        # Check gotten from stat.py (for fast checking):
        if (st.st_mode & 0170000) != 0040000:
            e = OSError()
            e.errno = 20; e.filename = path; e.strerror = "Not a directory"
            raise e
        dir_p = opendir(path)
        try:
            while True:
                p = readdir(dir_p)
                if not p: break # End of directory
                name = p.contents.d_name
                if name!="." and name!="..": yield name
        finally: closedir(dir_p)

else:
    print >> sys.stderr, "WARNING: Platform is '"+sys.platform+"'!\nFalling back to os.listdir(), iterator generator will not be returned!"
    listdir = os.listdir

if __name__ == "__main__":
    print
    if len(sys.argv)!=1:
        try: limit = int(sys.argv[2])
        except: limit = -1
        count = 0
        for name in listdir(sys.argv[1]):
            if count==limit: break
            count += 1
            print repr(name),
        print "\nListed", count, "items from directory '%s'" % sys.argv[1]
    if len(sys.argv)!=1: sys.exit()
    from timeit import *
    print "Speed test:"
    dir = ("/etc", r"C:\WINDOWS\system32")[sys.platform.startswith("win")]
    t = Timer("l = listdir(%s)" % repr(dir), "from listdir import listdir")
    print "Measuring time required to create an iterator to list a directory:"
    time = t.timeit(200)
    print "Time required to return a generator for directory '"+dir+"' is", time, "seconds measured through 200 passes"
    t = Timer("l = os.listdir(%s)" % repr(dir), "import os")
    print "Measuring time required to create a list of directory in advance using os.listdir():"
    time = t.timeit(200)
    print "Time required to return a list for directory '"+dir+"' is", time, "seconds measured through 200 passes"
    t = Timer("l = []\nfor file in listdir(%s): l.append(file)" % repr(dir), "from listdir import listdir")
    print "Measuring time needed to create a list of directory using our listdir() instead of os.listdir():"
    time = t.timeit(200)
    print "Time required to create a list for directory '"+dir+"' using our listdir() instead of os.listdir() is", time, "seconds measured through 200 passes"