我有一堆.csv文件,它们在一个文件夹中是相似的(行和列数相同)。我想在另一个数据框中读取所有这些数据,而每个数据框中仅保留一个特定的列(“总计”一列)(最好带有一些标识源文件的标头)。 这是我到目前为止所拥有的:
table_c
我想要的是这样的
import pandas as pd
import glob
path = r'C:\Users\lsminervino\Desktop\MUN'
files = glob.glob(path + "/*.csv")
all_files = pd.concat([pd.read_csv(f , encoding="latin", sep=';', thousands='.', decimal=',') for f in files],axis =1, sort=False)
all_files.head()
OUT:
Unnamed: 0 Total Cadastro Sem Registro Civil \
0 3500105 - Adamantina 0.0 0.0 0.0
1 3500204 - Adolfo 0.0 0.0 0.0
2 3500303 - Aguaí 0.0 0.0 0.0
3 3500402 - Águas da Prata 0.0 0.0 0.0
4 3500501 - Águas de Lindóia 0.0 0.0 0.0
Unnamed: 0 Total Cadastro Sem Registro Civil \
0 3500105 - Adamantina 3.0 3.0 0.0
1 3500204 - Adolfo 0.0 0.0 0.0
2 3500303 - Aguaí 3.0 3.0 0.0
3 3500402 - Águas da Prata 0.0 0.0 0.0
4 3500501 - Águas de Lindóia 0.0 0.0 0.0
Unnamed: 0 Total ... Sem registro civil \
0 3500105 - Adamantina 0.0 ... 0.0
1 3500204 - Adolfo 0.0 ... 0.0
2 3500303 - Aguaí 0.0 ... 0.0
3 3500402 - Águas da Prata 0.0 ... 0.0
4 3500501 - Águas de Lindóia 0.0 ... 0.0
Unnamed: 0 Total Cadastro Sem Registro Civil \
0 3500105 - Adamantina 0.0 0.0 0.0
1 3500204 - Adolfo 0.0 0.0 0.0
2 3500303 - Aguaí 0.0 0.0 0.0
3 3500402 - Águas da Prata 0.0 0.0 0.0
4 3500501 - Águas de Lindóia 0.0 0.0 0.0
Unnamed: 0 Total Cadastro Sem Registro Civil Unnamed: 4
0 3500105 - Adamantina 0.0 0.0 0.0 NaN
1 3500204 - Adolfo 0.0 0.0 0.0 NaN
2 3500303 - Aguaí 0.0 0.0 0.0 NaN
3 3500402 - Águas da Prata 0.0 0.0 0.0 NaN
4 3500501 - Águas de Lindóia 0.0 0.0 0.0 NaN
[5 rows x 61 columns]
答案 0 :(得分:1)
all_files = pd.concat([pd.read_csv(f , encoding="latin", sep=';', thousands='.', decimal=',', usecols=['Total']).rename(columns={'Total':'Total_{}'.format(f.rpartition('\\')[2])}) for f in files], sort=False)
编辑:Windows路径=>更改为rpartition
中的反斜杠
答案 1 :(得分:0)
我认为一切都归结为函数,并把文件编号附加到Total
。这是一个完整的示例:
import os
import pandas as pd
import numpy as np
import glob
# Create dummy files
fldr = "data_test"
os.makedirs(fldr, exist_ok=True)
n_files = 10
N = 10
for i in range(n_files):
df = pd.DataFrame(np.random.randn(N,4),
columns=["Total", "a", "b", "c"])\
.sample(int(N*0.8))\
.to_csv(f"{fldr}/file_{i+1}.csv")
# list of files
files = sorted(glob.glob(f"{fldr}/*.csv"))
您要使用以下功能
n
。Total
重命名n
列。def customRead(fn):
n = fn.split("/")[-1].split(".")[0].split("_")[-1]
d = pd.read_csv(fn)[["Unnamed: 0", "Total"]]\
.set_index("Unnamed: 0")\
.rename(columns={"Total":f"Total_csv{n}"})
return d
然后您可以将df
连接起来,以通过
df = [customRead(fn) for fn in files]
df = pd.concat(df, axis=1)
如果您有很多大文件,可以考虑使用dask
,如下所示
import dask
df = [dask.delayed(customRead)(fn) for fn in files]
df = pd.concat(dask.compute(df, scheduler='processes')[0],
axis=1)