Question

我在特定数据集上有一个1000个csv文件的文件夹。对于每个数据集，我有三种类型的数据。命名约定如下：

（假设数据集的名称是'aaa'）

'aaa_values.csv'
'aaa_info.csv'
'aaaps.csv'

每个数据集都有不同长度的标题，例如'aaa'，'ab3fz'，'gjt89'等。请注意，第三种类型的数据没有下划线分隔类型 - 它是'ps.csv'

我想过滤这个来获取数据集的范围。我已经设法使用以下Python代码执行此操作，但它很笨重而且不是'Pythonic'。

是否有人建议采用更优雅的方式进行此操作？

import os
x = []
y = os.listdir("C:\\Anaconda3\\Python_Library\\Python_Folder\\csvData")
for i in y:
    x.append(i.split('_'))
h = []
for i in x:
    for j in i:
        h.append(j)
c = [l for l in h if l != 'values.csv']
c = list(set([p for p in c if p != 'info.csv']))
[t for t in c if t[-6:]!='ps.csv']

Answer 1

更加pythonic代码的第一步是使用比x，y，i，j，c和{{更具描述性的名称1}}。这看起来好一点：

现在：

# paths = os.listdir("C:\\Anaconda3\\Python_Library\\Python_Folder\\csvData")
paths = ['aaa_values.csv', 'aaa_info.csv', 'aaaps.csv', 'bbbps.csv', 'ccc_info.csv']
res = set()
for path in paths:
    path_parts = path.split('_')
    last = path_parts[-1]
    if (last == 'values.csv' or last == 'info.csv'):
        res.add(path_parts[0])
    elif last[-6:] == 'ps.csv':
        res.add(last[:-6])

Answer 2

使用集合删除重复项，并使用find()检查并找到＆＃39; _＆＃39;在文件名中：

import os

path = "C:\\Anaconda3\\Python_Library\\Python_Folder\\csvData"
suffixlength = len('ps.csv')

# use a set to remove duplicates
datasets = set()

for filename in os.listdir(path):

    # find() returns the index of the '_' or -1 if it isn't found.   
    i = filename.find('_')

    # chop the filename at the index of the '_' if found else just
    # before the 'ps.csv' suffix
    datasets.add(filename[:i] if i>0 else filename[:-suffixlength])

Answer 3

Regular expressions 和 set comprehensions 非常恐怖，所以要延伸Mike Muller的例子：

import re
import os

# see https://regex101.com/r/sR5cQ8 for the regex explanation
dataset_re = re.compile(r'^(.*)(?:_values|_info|ps)\.csv$')

# paths = os.listdir("C:\\Anaconda3\\Python_Library\\Python_Folder\\csvData")
paths = ['aaa_values.csv', 'aaa_info.csv', 'aaaps.csv',
         'bbbps.csv', 'ccc_info.csv', 'README.txt']
dataset_universe = {match for path in paths
                          for match in dataset_re.findall(path)}

dataset_universe  # {'aaa', 'bbb', 'ccc'}

<小时/> 从https://regex101.com/r/sR5cQ8复制的正则表达式解释：

^ assert position at start of the string
1st Capturing group (.*)
    .* matches any character (except newline)
        Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
(?:_values|_info|ps) Non-capturing group
    1st Alternative: _values
        _values matches the characters _values literally (case sensitive)
    2nd Alternative: _info
        _info matches the characters _info literally (case sensitive)
    3rd Alternative: ps
        ps matches the characters ps literally (case sensitive)
\. matches the character . literally
csv matches the characters csv literally (case sensitive)
$ assert position at end of the string

Answer 4

以下是使用pandas执行此操作的另一种方法：

考虑您的输入文件夹包含以下文件：

aaa_info.csv
aaa_values.csv
aaapd.csv
bbb_info.csv
bbb_values.csv
bbbpd.csv
ccc_info.csv
ccc_values.csv
cccpd.csv
ddd_info.csv
ddd_values.csv
dddpd.csv

<强> CODE

import os
import glob
import pandas as pd

#Get all the csv files in the folder
flist = [os.path.basename(x) for x in glob.glob(os.getcwd() + '\\*.csv')]

#Create a dataframe
df = pd.DataFrame()

#Create a column named files and assign the file list to it
df['files'] = flist

#Create another column named set and assign only the portion of the file name that we want to create set by
df['set'] = [x.rsplit('_')[0] if '_' in x else x[:-6] for x in flist]

#Group by the set names required
#if you want the output as sets
dfs = df.groupby(df['set']).apply(lambda x: set(x['files']))

#if you want them to be CSV
dfg = df.groupby(df['set']).apply(lambda x: ','.join(x['files']))

dfg.to_csv('setoffiles.csv')

以下将是产生的输出

用于过滤csvs中数据的pythonic样式

4 个答案: