如何读取一组.csv文件,每个文件仅保留一个指定列

时间:2019-09-05 20:45:53

标签: python pandas dataframe

我有一堆.csv文件,它们在一个文件夹中是相似的(行和列数相同)。我想在另一个数据框中读取所有这些数据,而每个数据框中仅保留一个特定的列(“总计”一列)(最好带有一些标识源文件的标头)。 这是我到目前为止所拥有的:

table_c

我想要的是这样的

import pandas as pd
import glob
path = r'C:\Users\lsminervino\Desktop\MUN'
files = glob.glob(path + "/*.csv")
all_files = pd.concat([pd.read_csv(f , encoding="latin", sep=';', thousands='.', decimal=',') for f in files],axis =1, sort=False)
all_files.head()

OUT:

                  Unnamed: 0 Total  Cadastro  Sem Registro Civil  \
0        3500105 - Adamantina   0.0       0.0                 0.0   
1            3500204 - Adolfo   0.0       0.0                 0.0   
2             3500303 - Aguaí   0.0       0.0                 0.0   
3    3500402 - Águas da Prata   0.0       0.0                 0.0   
4  3500501 - Águas de Lindóia   0.0       0.0                 0.0   

                   Unnamed: 0 Total  Cadastro  Sem Registro Civil  \
0        3500105 - Adamantina   3.0       3.0                 0.0   
1            3500204 - Adolfo   0.0       0.0                 0.0   
2             3500303 - Aguaí   3.0       3.0                 0.0   
3    3500402 - Águas da Prata   0.0       0.0                 0.0   
4  3500501 - Águas de Lindóia   0.0       0.0                 0.0   

                   Unnamed: 0 Total     ...      Sem registro civil  \
0        3500105 - Adamantina   0.0     ...                     0.0   
1            3500204 - Adolfo   0.0     ...                     0.0   
2             3500303 - Aguaí   0.0     ...                     0.0   
3    3500402 - Águas da Prata   0.0     ...                     0.0   
4  3500501 - Águas de Lindóia   0.0     ...                     0.0   

                   Unnamed: 0 Total  Cadastro  Sem Registro Civil  \
0        3500105 - Adamantina   0.0       0.0                 0.0   
1            3500204 - Adolfo   0.0       0.0                 0.0   
2             3500303 - Aguaí   0.0       0.0                 0.0   
3    3500402 - Águas da Prata   0.0       0.0                 0.0   
4  3500501 - Águas de Lindóia   0.0       0.0                 0.0   

                   Unnamed: 0 Total  Cadastro  Sem Registro Civil  Unnamed: 4  
0        3500105 - Adamantina   0.0       0.0                 0.0         NaN  
1            3500204 - Adolfo   0.0       0.0                 0.0         NaN  
2             3500303 - Aguaí   0.0       0.0                 0.0         NaN  
3    3500402 - Águas da Prata   0.0       0.0                 0.0         NaN  
4  3500501 - Águas de Lindóia   0.0       0.0                 0.0         NaN  

[5 rows x 61 columns]   

2 个答案:

答案 0 :(得分:1)

all_files = pd.concat([pd.read_csv(f , encoding="latin", sep=';', thousands='.', decimal=',', usecols=['Total']).rename(columns={'Total':'Total_{}'.format(f.rpartition('\\')[2])}) for f in files], sort=False)

编辑:Windows路径=>更改为rpartition中的反斜杠

答案 1 :(得分:0)

我认为一切都归结为函数,并把文件编号附加到Total。这是一个完整的示例:

import os
import pandas as pd
import numpy as np
import glob

# Create dummy files
fldr = "data_test"
os.makedirs(fldr, exist_ok=True)
n_files = 10
N = 10
for i in range(n_files):
    df = pd.DataFrame(np.random.randn(N,4),
                      columns=["Total", "a", "b", "c"])\
           .sample(int(N*0.8))\
           .to_csv(f"{fldr}/file_{i+1}.csv")

# list of files
files = sorted(glob.glob(f"{fldr}/*.csv"))

您要使用以下功能

  1. 从文件名中提取文件号n
  2. 阅读csv。
  3. 设置索引。
  4. 从第1点开始,使用Total重命名n列。
def customRead(fn):
    n = fn.split("/")[-1].split(".")[0].split("_")[-1]
    d = pd.read_csv(fn)[["Unnamed: 0", "Total"]]\
          .set_index("Unnamed: 0")\
          .rename(columns={"Total":f"Total_csv{n}"})
    return d

然后您可以将df连接起来,以通过

获得所需的输出
df = [customRead(fn) for fn in files]
df = pd.concat(df, axis=1)

奖金跟踪

如果您有很多大文件,可以考虑使用dask,如下所示

import dask
df = [dask.delayed(customRead)(fn) for fn in files]
df = pd.concat(dask.compute(df, scheduler='processes')[0],
               axis=1)