我正在使用熊猫加载成千上万的CSV。但是,我只对某些可能不在所有CSV中出现的列感兴趣。
如果其中一个CSV中不存在指定的列名,则usecols参数似乎不起作用。最好的解决方法是什么?谢谢
import pandas as pd
for fullPath in listFilenamesPath:
df = pd.read_csv(fullPath, sep= ";" , usecols = ['name','hostname', 'application family'])
df.to_csv(fullPath, sep = ';', index = False, header = True, encoding = 'utf-8')
nrFiles = nrFiles + 1
print(nrFiles, "files converted")
答案 0 :(得分:1)
一种解决方法是获取同时出现在usecols
列表(要查找的列的列表)和df.columns
中的列名。然后,您可以使用此公共列名称列表来子集df
。
带有必要注释的代码:
### the column names you want to look for in the dataframes
usecols = ['name','hostname', 'application family']
for fullPath in listFilenamesPath:
### read the entire dataframe without usecols
df = pd.read_csv(fullPath, sep= ";")
### get the column names that appear in both usecols list as well as df.columns
final_list = list(set(usecols) & set(df.columns))
### subset it using the final_list
df = df[final_list]
### write your df to csv and continue as usual
df.to_csv(fullPath, sep = ';', index = False, header = True, encoding = 'utf-8')
nrFiles = nrFiles + 1
print(nrFiles, "files converted")
这是带有df的csv:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
我要查找以下列:
usecols = ['A', 'D', 'B']
我阅读了整个CSV。我得到了df和要查找的列之间的公共列,在这种情况下,它们是A和B,并将其子集如下:
df = pd.read_csv('test1.csv')
final_list = list(set(cols) & set(df.columns))
df = df[final_list]
print(df)
输出:
B A
0 4 1
1 5 2
2 6 3
答案 1 :(得分:0)
您无需使用HTTPSConnectionPool(host='myaccount.blob...', port=443): Read timed out. (read timeout=65)
就可以阅读整个csv。这将允许您检查DataFrame包含哪些列。如果DataFrame没有所需的列,则可以忽略它或按需要对其进行处理。
答案 2 :(得分:0)
当read_csv找不到usecols参数中指定的列时,似乎会抛出ValueError。我认为您可以使用try catch块并跳过引发错误的文件。
for fullPath in listFilenamesPath:
try:
df = pd.read_csv(fullPath, sep= ";" , usecols = ['name','hostname', 'application family'])
except ValueError:
pass
或捕获错误,尝试解析冲突的列名,然后使用子集重试。可能有一种更干净的方法可以做到这一点。
import pandas as pd
import re
usecols = ['name','hostname', 'application family']
for fullPath in listFilenamesPath:
usecols_ = usecols
while usecols_:
try:
df = pd.read_csv(fullPath, sep= ";" , usecols = usecols_)
break
except ValueError as e:
r = re.search(r"\[(.+)\]", str(e))
missing_cols = r.group(1).replace("'","").replace(" ", "").split(",")
usecols_ = [x for x in usecols_ if x not in missing_cols]
"""
rest of your code
"""