用于过滤csvs中数据的pythonic样式

时间:2016-04-23 03:07:22

标签: python string list csv set

我在特定数据集上有一个1000个csv文件的文件夹。对于每个数据集,我有三种类型的数据。命名约定如下:

(假设数据集的名称是'aaa')

  1. 'aaa_values.csv'
  2. 'aaa_info.csv'
  3. 'aaaps.csv'
  4. 每个数据集都有不同长度的标题,例如'aaa','ab3fz','gjt89'等。请注意,第三种类型的数据没有下划线分隔类型 - 它是'ps.csv'

    我想过滤这个来获取数据集的范围。我已经设法使用以下Python代码执行此操作,但它很笨重而且不是'Pythonic'。

    是否有人建议采用更优雅的方式进行此操作?

    import os
    x = []
    y = os.listdir("C:\\Anaconda3\\Python_Library\\Python_Folder\\csvData")
    for i in y:
        x.append(i.split('_'))
    h = []
    for i in x:
        for j in i:
            h.append(j)
    c = [l for l in h if l != 'values.csv']
    c = list(set([p for p in c if p != 'info.csv']))
    [t for t in c if t[-6:]!='ps.csv']
    

4 个答案:

答案 0 :(得分:1)

更加pythonic代码的第一步是使用比xyijc和{{更具描述性的名称1}}。这看起来好一点:

p

现在:

# paths = os.listdir("C:\\Anaconda3\\Python_Library\\Python_Folder\\csvData")
paths = ['aaa_values.csv', 'aaa_info.csv', 'aaaps.csv', 'bbbps.csv', 'ccc_info.csv']
res = set()
for path in paths:
    path_parts = path.split('_')
    last = path_parts[-1]
    if (last == 'values.csv' or last == 'info.csv'):
        res.add(path_parts[0])
    elif last[-6:] == 'ps.csv':
        res.add(last[:-6])

答案 1 :(得分:0)

使用集合删除重复项,并使用find()检查并找到' _'在文件名中:

import os

path = "C:\\Anaconda3\\Python_Library\\Python_Folder\\csvData"
suffixlength = len('ps.csv')

# use a set to remove duplicates
datasets = set()

for filename in os.listdir(path):

    # find() returns the index of the '_' or -1 if it isn't found.   
    i = filename.find('_')

    # chop the filename at the index of the '_' if found else just
    # before the 'ps.csv' suffix
    datasets.add(filename[:i] if i>0 else filename[:-suffixlength])

答案 2 :(得分:0)

Regular expressions set comprehensions 非常恐怖,所以要延伸Mike Muller的例子:

import re
import os

# see https://regex101.com/r/sR5cQ8 for the regex explanation
dataset_re = re.compile(r'^(.*)(?:_values|_info|ps)\.csv$')

# paths = os.listdir("C:\\Anaconda3\\Python_Library\\Python_Folder\\csvData")
paths = ['aaa_values.csv', 'aaa_info.csv', 'aaaps.csv',
         'bbbps.csv', 'ccc_info.csv', 'README.txt']
dataset_universe = {match for path in paths
                          for match in dataset_re.findall(path)}

dataset_universe  # {'aaa', 'bbb', 'ccc'}

<小时/> 从https://regex101.com/r/sR5cQ8复制的正则表达式解释:

^ assert position at start of the string
1st Capturing group (.*)
    .* matches any character (except newline)
        Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
(?:_values|_info|ps) Non-capturing group
    1st Alternative: _values
        _values matches the characters _values literally (case sensitive)
    2nd Alternative: _info
        _info matches the characters _info literally (case sensitive)
    3rd Alternative: ps
        ps matches the characters ps literally (case sensitive)
\. matches the character . literally
csv matches the characters csv literally (case sensitive)
$ assert position at end of the string

答案 3 :(得分:0)

以下是使用pandas执行此操作的另一种方法:

  

考虑您的输入文件夹包含以下文件:

aaa_info.csv
aaa_values.csv
aaapd.csv
bbb_info.csv
bbb_values.csv
bbbpd.csv
ccc_info.csv
ccc_values.csv
cccpd.csv
ddd_info.csv
ddd_values.csv
dddpd.csv
  

<强> CODE

import os
import glob
import pandas as pd

#Get all the csv files in the folder
flist = [os.path.basename(x) for x in glob.glob(os.getcwd() + '\\*.csv')]

#Create a dataframe
df = pd.DataFrame()

#Create a column named files and assign the file list to it
df['files'] = flist

#Create another column named set and assign only the portion of the file name that we want to create set by
df['set'] = [x.rsplit('_')[0] if '_' in x else x[:-6] for x in flist]

#Group by the set names required
#if you want the output as sets
dfs = df.groupby(df['set']).apply(lambda x: set(x['files']))

#if you want them to be CSV
dfg = df.groupby(df['set']).apply(lambda x: ','.join(x['files']))

dfg.to_csv('setoffiles.csv')
  

以下将是产生的输出

enter image description here