我在特定数据集上有一个1000个csv文件的文件夹。对于每个数据集,我有三种类型的数据。命名约定如下:
(假设数据集的名称是'aaa')
'aaa_values.csv'
'aaa_info.csv'
'aaaps.csv'
每个数据集都有不同长度的标题,例如'aaa','ab3fz','gjt89'等。请注意,第三种类型的数据没有下划线分隔类型 - 它是'ps.csv'
我想过滤这个来获取数据集的范围。我已经设法使用以下Python代码执行此操作,但它很笨重而且不是'Pythonic'。
是否有人建议采用更优雅的方式进行此操作?
import os
x = []
y = os.listdir("C:\\Anaconda3\\Python_Library\\Python_Folder\\csvData")
for i in y:
x.append(i.split('_'))
h = []
for i in x:
for j in i:
h.append(j)
c = [l for l in h if l != 'values.csv']
c = list(set([p for p in c if p != 'info.csv']))
[t for t in c if t[-6:]!='ps.csv']
答案 0 :(得分:1)
更加pythonic代码的第一步是使用比x
,y
,i
,j
,c
和{{更具描述性的名称1}}。这看起来好一点:
p
现在:
# paths = os.listdir("C:\\Anaconda3\\Python_Library\\Python_Folder\\csvData")
paths = ['aaa_values.csv', 'aaa_info.csv', 'aaaps.csv', 'bbbps.csv', 'ccc_info.csv']
res = set()
for path in paths:
path_parts = path.split('_')
last = path_parts[-1]
if (last == 'values.csv' or last == 'info.csv'):
res.add(path_parts[0])
elif last[-6:] == 'ps.csv':
res.add(last[:-6])
答案 1 :(得分:0)
使用集合删除重复项,并使用find()
检查并找到' _'在文件名中:
import os
path = "C:\\Anaconda3\\Python_Library\\Python_Folder\\csvData"
suffixlength = len('ps.csv')
# use a set to remove duplicates
datasets = set()
for filename in os.listdir(path):
# find() returns the index of the '_' or -1 if it isn't found.
i = filename.find('_')
# chop the filename at the index of the '_' if found else just
# before the 'ps.csv' suffix
datasets.add(filename[:i] if i>0 else filename[:-suffixlength])
答案 2 :(得分:0)
Regular expressions 和 set comprehensions 非常恐怖,所以要延伸Mike Muller的例子:
import re
import os
# see https://regex101.com/r/sR5cQ8 for the regex explanation
dataset_re = re.compile(r'^(.*)(?:_values|_info|ps)\.csv$')
# paths = os.listdir("C:\\Anaconda3\\Python_Library\\Python_Folder\\csvData")
paths = ['aaa_values.csv', 'aaa_info.csv', 'aaaps.csv',
'bbbps.csv', 'ccc_info.csv', 'README.txt']
dataset_universe = {match for path in paths
for match in dataset_re.findall(path)}
dataset_universe # {'aaa', 'bbb', 'ccc'}
<小时/> 从https://regex101.com/r/sR5cQ8复制的正则表达式解释:
^ assert position at start of the string
1st Capturing group (.*)
.* matches any character (except newline)
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
(?:_values|_info|ps) Non-capturing group
1st Alternative: _values
_values matches the characters _values literally (case sensitive)
2nd Alternative: _info
_info matches the characters _info literally (case sensitive)
3rd Alternative: ps
ps matches the characters ps literally (case sensitive)
\. matches the character . literally
csv matches the characters csv literally (case sensitive)
$ assert position at end of the string
答案 3 :(得分:0)
以下是使用pandas
执行此操作的另一种方法:
考虑您的输入文件夹包含以下文件:
aaa_info.csv
aaa_values.csv
aaapd.csv
bbb_info.csv
bbb_values.csv
bbbpd.csv
ccc_info.csv
ccc_values.csv
cccpd.csv
ddd_info.csv
ddd_values.csv
dddpd.csv
<强> CODE 强>
import os
import glob
import pandas as pd
#Get all the csv files in the folder
flist = [os.path.basename(x) for x in glob.glob(os.getcwd() + '\\*.csv')]
#Create a dataframe
df = pd.DataFrame()
#Create a column named files and assign the file list to it
df['files'] = flist
#Create another column named set and assign only the portion of the file name that we want to create set by
df['set'] = [x.rsplit('_')[0] if '_' in x else x[:-6] for x in flist]
#Group by the set names required
#if you want the output as sets
dfs = df.groupby(df['set']).apply(lambda x: set(x['files']))
#if you want them to be CSV
dfg = df.groupby(df['set']).apply(lambda x: ','.join(x['files']))
dfg.to_csv('setoffiles.csv')
以下将是产生的输出