创建df熊猫python usecols时如果不存在则跳过列

时间:2019-09-07 21:04:50

标签: python pandas

我正在使用熊猫加载成千上万的CSV。但是,我只对某些可能不在所有CSV中出现的列感兴趣。

如果其中一个CSV中不存在指定的列名,则usecols参数似乎不起作用。最好的解决方法是什么?谢谢

import pandas as pd
for fullPath in listFilenamesPath:
    df = pd.read_csv(fullPath, sep= ";" , usecols = ['name','hostname', 'application family'])
    df.to_csv(fullPath, sep = ';', index = False, header = True, encoding = 'utf-8')
    nrFiles = nrFiles + 1
    print(nrFiles, "files converted")

3 个答案:

答案 0 :(得分:1)

一种解决方法是获取同时出现在usecols列表(要查找的列的列表)和df.columns中的列名。然后,您可以使用此公共列名称列表来子集df

带有必要注释的代码:

### the column names you want to look for in the dataframes
usecols = ['name','hostname', 'application family']

for fullPath in listFilenamesPath:
    ### read the entire dataframe without usecols
    df = pd.read_csv(fullPath, sep= ";")
    ### get the column names that appear in both usecols list as well as df.columns
    final_list = list(set(usecols) & set(df.columns))
    ### subset it using the final_list
    df = df[final_list]
    ### write your df to csv and continue as usual
    df.to_csv(fullPath, sep = ';', index = False, header = True, encoding = 'utf-8')
    nrFiles = nrFiles + 1
    print(nrFiles, "files converted")


演示:

这是带有df的csv:

   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9

我要查找以下列:

usecols = ['A', 'D', 'B']

我阅读了整个CSV。我得到了df和要查找的列之间的公共列,在这种情况下,它们是A和B,并将其子集如下:

df = pd.read_csv('test1.csv')
final_list = list(set(cols) & set(df.columns))
df = df[final_list]
print(df)

输出:

   B  A
0  4  1
1  5  2
2  6  3

答案 1 :(得分:0)

您无需使用HTTPSConnectionPool(host='myaccount.blob...', port=443): Read timed out. (read timeout=65) 就可以阅读整个csv。这将允许您检查DataFrame包含哪些列。如果DataFrame没有所需的列,则可以忽略它或按需要对其进行处理。

答案 2 :(得分:0)

当read_csv找不到usecols参数中指定的列时,似乎会抛出ValueError。我认为您可以使用try catch块并跳过引发错误的文件。

for fullPath in listFilenamesPath:
    try:
        df = pd.read_csv(fullPath, sep= ";" , usecols = ['name','hostname', 'application family'])
    except ValueError:
        pass

或捕获错误,尝试解析冲突的列名,然后使用子集重试。可能有一种更干净的方法可以做到这一点。

import pandas as pd
import re

usecols = ['name','hostname', 'application family']
for fullPath in listFilenamesPath:
    usecols_ = usecols
    while usecols_:
        try:
            df = pd.read_csv(fullPath, sep= ";" , usecols = usecols_)
            break
        except ValueError as e:
            r = re.search(r"\[(.+)\]", str(e))
            missing_cols = r.group(1).replace("'","").replace(" ", "").split(",")
            usecols_ = [x for x in usecols_ if x not in missing_cols]   

    """
        rest of your code
    """