Question

我正在使用熊猫加载成千上万的CSV。但是，我只对某些可能不在所有CSV中出现的列感兴趣。

如果其中一个CSV中不存在指定的列名，则usecols参数似乎不起作用。最好的解决方法是什么？谢谢

import pandas as pd
for fullPath in listFilenamesPath:
    df = pd.read_csv(fullPath, sep= ";" , usecols = ['name','hostname', 'application family'])
    df.to_csv(fullPath, sep = ';', index = False, header = True, encoding = 'utf-8')
    nrFiles = nrFiles + 1
    print(nrFiles, "files converted")

Answer 1

一种解决方法是获取同时出现在usecols列表（要查找的列的列表）和df.columns中的列名。然后，您可以使用此公共列名称列表来子集df。

带有必要注释的代码：

### the column names you want to look for in the dataframes
usecols = ['name','hostname', 'application family']

for fullPath in listFilenamesPath:
    ### read the entire dataframe without usecols
    df = pd.read_csv(fullPath, sep= ";")
    ### get the column names that appear in both usecols list as well as df.columns
    final_list = list(set(usecols) & set(df.columns))
    ### subset it using the final_list
    df = df[final_list]
    ### write your df to csv and continue as usual
    df.to_csv(fullPath, sep = ';', index = False, header = True, encoding = 'utf-8')
    nrFiles = nrFiles + 1
    print(nrFiles, "files converted")

演示：

这是带有df的csv：

我要查找以下列：

usecols = ['A', 'D', 'B']

我阅读了整个CSV。我得到了df和要查找的列之间的公共列，在这种情况下，它们是A和B，并将其子集如下：

df = pd.read_csv('test1.csv')
final_list = list(set(cols) & set(df.columns))
df = df[final_list]
print(df)

输出：

Answer 2

您无需使用HTTPSConnectionPool(host='myaccount.blob...', port=443): Read timed out. (read timeout=65)就可以阅读整个csv。这将允许您检查DataFrame包含哪些列。如果DataFrame没有所需的列，则可以忽略它或按需要对其进行处理。

Answer 3

当read_csv找不到usecols参数中指定的列时，似乎会抛出ValueError。我认为您可以使用try catch块并跳过引发错误的文件。

for fullPath in listFilenamesPath:
    try:
        df = pd.read_csv(fullPath, sep= ";" , usecols = ['name','hostname', 'application family'])
    except ValueError:
        pass

或捕获错误，尝试解析冲突的列名，然后使用子集重试。可能有一种更干净的方法可以做到这一点。

import pandas as pd
import re

usecols = ['name','hostname', 'application family']
for fullPath in listFilenamesPath:
    usecols_ = usecols
    while usecols_:
        try:
            df = pd.read_csv(fullPath, sep= ";" , usecols = usecols_)
            break
        except ValueError as e:
            r = re.search(r"\[(.+)\]", str(e))
            missing_cols = r.group(1).replace("'","").replace(" ", "").split(",")
            usecols_ = [x for x in usecols_ if x not in missing_cols]   

    """
        rest of your code
    """

创建df熊猫python usecols时如果不存在则跳过列

3 个答案:

演示：