Folder name as one of the column names

时间:2016-05-17 11:08:42

标签: python file pandas directory dataset

I have like 1000s of files in more than 100s of folders. I need to write one of the folder's name into the file as one of the column.

Directory Structure:

Data -> 000 -> Trajectory -> set of files
Data -> 001 -> Trajectory -> set of files
Data -> 002 -> Trajectory -> set of files
Data -> 003 -> Trajectory -> set of files
.        .        .
.        .        .
.        .        .
Data -> nnn -> Trajectory -> set of files

Every Trajectory folder has more than 100s of files and every file has following columns. Every file has an extension .plt

39.984702,116.318417,0,492,39744.1201851852,2008-10-23,02:53:04
39.984683,116.31845,0,492,39744.1202546296,2008-10-23,02:53:10
39.984686,116.318417,0,492,39744.1203125,2008-10-23,02:53:15
39.984688,116.318385,0,492,39744.1203703704,2008-10-23,02:53:20
39.984655,116.318263,0,492,39744.1204282407,2008-10-23,02:53:25
39.984611,116.318026,0,493,39744.1204861111,2008-10-23,02:53:30

What I am trying to get it put the folder name as one of the column names.

Expected output: for the files in folder with name 000

000 39.984702,116.318417,0,492,39744.1201851852,2008-10-23,02:53:04
000 39.984683,116.31845,0,492,39744.1202546296,2008-10-23,02:53:10
000 39.984686,116.318417,0,492,39744.1203125,2008-10-23,02:53:15
000 39.984688,116.318385,0,492,39744.1203703704,2008-10-23,02:53:20
000 39.984655,116.318263,0,492,39744.1204282407,2008-10-23,02:53:25
000 39.984611,116.318026,0,493,39744.1204861111,2008-10-23,02:53:30

I could not find any near by sample to work around with. Any suggestion will be helpful.

Edit 1: As suggested by @EdChum about using glob But that only allows me to find files with given extension. But my problem here is something else.

In more simple words

rootdir -> subdir_1 -> subdir_2 -> files

Include the name of subdir_1 as col[0] in all the files present in subdir_2 along with other columns. The files can be appended no need to create a new output file.

1 个答案:

答案 0 :(得分:1)

  • 第一个代码块将获取以.plt
  • 结尾的所有文件
  • 接下来,我们检查您的subdir_1是否实际上只包含数字并且字符长(只是进行一些健全性检查以确保我们没有点击所有以.plt结尾的文件)以及是否文件位于轨迹文件夹中。
  • 最后,打开一个与原始文件同名的新文件,但会追加.new。读取旧文件中的每一行,在开头添加一个带有目录名称的新列,并将新行写入输出文件。


import os

#get all plt files
traj_files = []
for root, dirs, files in os.walk('Data'):
    for filename in files:
        if filename.endswith('.plt'):
            traj_files.append(os.path.join(root, filename))

for traj_file in traj_files:

    #the new column we want to write
    new_col = traj_file.split('/')[1]
    #check if filename looks OK
    if len(new_col) != 3 or not new_col.isnumeric() or not '/Trajectory/' in traj_file:
        continue

    #read old file and write new column
    with open(traj_file + '.new', 'w') as new_traj:
        with open(traj_file, 'r') as old_traj:
            for line in old_traj.readlines():
                new_traj.write(new_col + ' ' + line)

当然有更灵活和优雅的方法,但这应该适用于您的特定目录结构。