Question

我有10,000个文件夹（test1，test2，...，test10000），每个文件夹包含相同的五个.csv文件（real.csv，model.csv，run.csv，swe.csv，error.csv），但每个值都不相同。

我需要将每个具有相同文件名的csv合并到1个csv中，即，将10,000个.csv（real.csv）文件中的所有数据生成一个串联的.csv（例如real.csv）。并且我需要对连接的.csv进行排序，即我需要第1行来自test1，第2行来自test2，...，第10,000行来自test10000。

我已将此处的代码用作蓝图，并对其进行了编辑以进行排序。 Merge multiple csv files with same name in 10 different subdirectory

import pandas as pd
import glob

concat_dir = '/home/zachary/workspace/lineartransfer/error/files_concat/'

files = pd.DataFrame([file for file in glob.glob("/home/zachary/workspace/lineartransfer/error/*/*")], columns=["fullpath"])

# Split the full path into directory and filename
files_split = files['fullpath'].str.rsplit("/", 1, expand=True).rename(columns={0: 'path', 1: 'filename'})

# Join these into one DataFrame
files = files.join(files_split)

# Iterate over unique filenames; read CSVs, concat DFs, save file
for f in files['filename'].unique():
    paths = files[files['filename'] == f]['fullpath'] # Get list of fullpaths from unique filenames
    dfs = [pd.read_csv(path, header=None) for path in sorted(paths)] # Get list of dataframes from CSV file paths
    concat_df = pd.concat(dfs) # Concat dataframes into one
    concat_df.to_csv(concat_dir + f) # Save dataframe

上面的代码有效，但是我得到了以下排序： 1个 10 100 1000 10000 1001 1002 ... 102 1020 1021 ...

我需要排序： 1个 2 3 ... 10000

谢谢。

Answer 1

这是一种数字字母排序，它考虑了数字及其值，即使它们嵌入在字符串中也是如此。

from functools import cmp_to_key

def nasort(x, y):
    fx = re.sub(r'(\d+)', r'{:099d}', x)
    fy = re.sub(r'(\d+)', r'{:099d}', y)
    ax = map(int, re.sub(r'([^\d]+)', r' ', x).strip().split(' '))
    ay = map(int, re.sub(r'([^\d]+)', r' ', y).strip().split(' '))
    _x = fx.format(*ax)
    _y = fy.format(*ay)
    if   str(_x) > str(_y): return 1
    elif str(_x) < str(_y): return -1
    else: return 0

print (sorted(['file5', 'file2', 'file4', 'file1', 'file10']))
print (sorted(['file5', 'file2', 'file4', 'file1', 'file10'], key=cmp_to_key(nasort)))

第一行（输出数组）是标准排序。
第二行（输出数组）是新排序，其中file10在file5之后。 ['file1', 'file10', 'file2', 'file4', 'file5'] ['file1', 'file2', 'file4', 'file5', 'file10']

在保持排序的同时将10000个CSV合并到不同的文件夹中

1 个答案: