我有一个包含数百个文件的文件夹。从文件的名称来看,这些文件并非都是独立的,例如:name1_01.csv
,name1_02.csv
,...,name1_10.csv
,name2_01.csv
,name_2_02.csv
等
因此,存在多个根名称“ name1”,“ name2”,“ name3”等。我需要遍历同一根目录,以便可以合并这些文件的内容(例如,合并所有“ name1”文件的内容),然后删除重复的行,然后移至另一个根目录“ name2”,依此类推。
我不确定如何执行此操作,除非使用多个嵌套的for循环。有更好的方法吗?
答案 0 :(得分:0)
这对我有用:
import os
path = os.path.normpath('C:\\Users\\mateo\\Documents\\files')
path_list = os.listdir(path)
main_files = {}
for file in path_list:
name = file.split('_')[0]
if name in main_files:
with open(os.path.join(path, name + '.txt'), 'a+') as f:
with open(os.path.join(path, file), 'r') as f1:
for line in f1:
if line not in main_files[name]:
f.write(line)
main_files[name].add(line)
else:
main_files[name] = set()
with open(os.path.join(path, file), 'r') as f:
with open(os.path.join(path, name + '.txt'), 'w+') as f1:
for line in f:
if line not in main_files[name]:
f1.write(line)
main_files[name].add(line)
在这里您具有合并的文件,其中包含重复项已删除的文件,请注意,要使此文件夹起作用,该文件夹必须只包含感兴趣的文件,并且这些文件应遵循命名约定:“ nameN_something.extension”,用于您的情况下,将扩展名从.txt更改为.csv
答案 1 :(得分:0)
from pathlib import Path
from itertools import groupby
import re
def groupby_fn(filename):
m = re.match(r'[^_]+_', filename) # match everything up until the first '_'
return m[0]
dir = sorted(str(f) for f in Path('.').glob('*_.csv') if f.is_file())
groups = []
for k, g in groupby(dir, groupby_fn):
groups.append(list(g)) # these are your groupings
for group in groups:
print(group) # this is a list of files the same prefix, such as 'name1_'
在我的目录中,我有文件:
04/10/2019 06:03 PM 28,744 donors1.csv
04/12/2019 12:02 PM 9,821 donors10.csv
04/12/2019 12:15 PM 14,019 donors3.csv
04/12/2019 12:01 PM 15,581 donors5.csv
因此,我修改了上面的代码:
from pathlib import Path
from itertools import groupby
import re
def groupby_fn(filename):
m = re.match(r'donors\d', filename) # match everything up until the first digit
return m[0]
dir = sorted(str(f) for f in Path('.').glob('donors*.csv') if f.is_file()) # or glob('*') if you want all files
groups = []
for k, g in groupby(dir, groupby_fn):
groups.append(list(g)) # these are your groupings
for group in groups:
print(group) # this is a list of files the same prefix, such as 'donors1'
打印:
['donors1.csv', 'donors10.csv']
['donors3.csv']
['donors5.csv']