循环获取文件大小,文件夹大小和目录大小?

时间:2020-05-01 17:16:46

标签: python pandas dataframe

我正在尝试扫描目录以及其中的所有子文件夹和文件。我还想获取每个文件夹和文件的文件大小。我对最佳技术有些困惑。到目前为止,这就是我所拥有的。目录总输出不正确,文件夹总大小也不正确。

import os
import pandas as pd
import time
from pathlib import Path

# sets the display so that when the code prints, it is readable
pd.set_option('display.max_rows', 3000)
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 3000)

# Initialize the dataframe
col_names = ['directory', 'file name', 'file size', 'file date', 'total in directory', 'total in folder']
files = pd.DataFrame(columns=col_names)

dir_path = Path('G:/OM/Permits')
for dirpath, dirnames, filenames in os.walk(dir_path):
    print(dirpath)
    files.loc[dirpath, 'directory'] = dirpath
    total_file = sum(os.path.getsize(f) for f in os.scandir(dirpath) if os.path.isfile(f))
    files.loc[total_file, 'total in directory'] = total_file
    for file_size in dirpath:
        total_file = round((sum(os.path.getsize(f) for f in os.scandir(dirpath) if os.path.isfile(f)) / 1048576), 3)
        files.loc[total_file, 'total in folder'] = total_file
    with os.scandir(dirpath) as i:
     for entry in i:
         if entry.is_file():
             print(entry.name)
             files.loc[entry.name, 'file name'] = entry.name
             file_size = round((os.path.getsize(entry) / 1048576),3)
             files.loc[file_size, 'file size'] = file_size
             files_date = time.strftime('%m/%d/%Y', time.gmtime(os.path.getmtime(entry)))
             files.loc[files_date, 'file date'] = files_date

df = pd.DataFrame(files)
df['file size'] = df['file size'].shift(periods=-1)
df['file date'] = df['file date'].shift(periods=-2)
df.reset_index(drop=True, inplace=True)
df.dropna(how='all')
print(df)
#df.to_csv('G Drive List of Files.csv')

这是我输出的一部分。

                                             directory                                          file name file size   file date total in directory total in folder
0                                         G:\OM\Permits                                                NaN       NaN         NaN                NaN             NaN
1                                                   NaN                                                NaN       NaN         NaN            1394256             NaN
2                                                   NaN                                                NaN       NaN         NaN                NaN            1.33
3                                                   NaN                           3-Letter_PermitCodes.pdf     0.136  04/01/2019                NaN             NaN

1 个答案:

答案 0 :(得分:3)

您可以尝试将所有信息添加到dict中,然后将其转换为dataframe

  1. 使用os.wal并针对每个文件收集所有文件信息:

    • 添加与您一样的directoryfile_namefile_sizefile_date
  2. data转换为数据框

  3. 将所有directory分组,并计算一些 aggregation 函数,例如countsum

代码

dir_path = Path(r'G:/OM/Permits')

# Collect data for all files in the directory
data = {'directory': [], 'file_name': [], 'file_size': [], 'file_date': []}
for dirpath, dirnames, filenames in os.walk(dir_path):
    for f in filenames:
        filename = "{}\{}" .format(dirpath, f)
        data["directory"].append(dirpath)
        data["file_name"].append(f)
        data["file_size"].append(os.path.getsize(filename))
        data["file_date"].append(time.strftime('%m/%d/%Y', time.gmtime(os.path.getmtime(filename))))

# Transform data in dataframe
files = pd.DataFrame(data)
print(files)

# details per folder:
folders_stats = files.groupby("directory").agg({"file_name": 'count',
                                                "file_size": "sum"}) \
                    .rename(columns={"count": "total_files", "sum": "total_size"}) \
                    .reset_index()
print(folders_stats)