Question

我正在尝试在读取csv的过程中向下转换列，因为在读取文件后执行此操作非常耗时。到目前为止，一切都很好。如果一列具有NA值，则当然会发生此问题。是否有可能忽略该内容或在读取过程中对它们进行过滤，也许是通过熊猫读取csv的转换器输入？那么“冗长”的论点有什么用呢？该文档介绍了有关指示非数值列中的NA值数量的信息。

到目前为止，我向下转换的方法是读取前两行并猜测dtype。在读取整个csv时，我为dtype参数创建了一个映射字典。当然，以后会在行中出现NaN值。因此，有可能出现混合dtype：

import pandas as pd

df = pd.read_csv(filePath, delimiter=delimiter, nrows=2, low_memory=True, memory_map=True,engine='c')

if downcast == True:
    mapdtypes = {'int64': 'int8', 'float64': 'float32'}
    dtypes = list(df.dtypes.apply(str).replace(mapdtypes))
    dtype = {key: value for (key, value) in enumerate(dtypes)}
    df = pd.read_csv(filePath, delimiter=delimiter, memory_map=True,engine='c', low_memory=True, dtype=dtype)

Answer 1

不确定我是否正确理解了您的问题，但您可能正在寻找 na_values参数，您可以在其中指定一个或多个字符串以将其识别为NaN值。

编辑：从各个列中获取dtype并将其保存到向下转换的字典中。同样，如果需要，您可以限制要读入df的行数。

import csv

# get only the column headers from the csv:
with open(filePath, 'r') as infile:
    reader = csv.DictReader(infile)
    fieldnames = reader.fieldnames

# iterate through each column to get the dtype:
dtypes = {}
for f in fieldnames:
    df = pd.read_csv(filePath, usecols=[f], nrows=1000)
    dtypes.update({f:str(df.iloc[:,0].dtypes)})

Answer 2

最初的问题与this one有关，所以用类似的信息回答。 Pandas v1.0+“整数数组”数据类型可以满足您的要求。使用类型的大写版本，例如“Int16”等。Pandas .isnull() 可以识别缺失值。这是一个例子。请注意 Pandas 特定的 Int16 数据类型 (Pandas Documentation) 中的大写字母“I”。

import pandas as pd
import numpy as np

dftemp = pd.DataFrame({'int_col':[4,np.nan,3,1],
                      'float_col':[0.0,1.0,np.nan,4.5]})

#Write to CSV (to be read back in to fully simulate CSV behavior with missing values etc.)
dftemp.to_csv('MixedTypes.csv', index=False)

lst_cols = ['int_col','float_col']
lst_dtypes = ['Int16','float']
dict_types = dict(zip(lst_cols,lst_dtypes))

#Unoptimized DataFrame    
df = pd.read_csv('MixedTypes.csv')
df

结果：

   int_col  float_col
0      4.0        0.0
1      NaN        1.0
2      3.0        NaN
3      1.0        4.5

重复变量类型的赋值——包括 int16 for int_col

df2 = pd.read_csv('Data.csv', dtype=dict_types)
print(df2)


   int_col  float_col
0        4        0.0
1     <NA>        1.0
2        3        NaN
3        1        4.5

熊猫：读取具有dtypes但混合类型列的csv（NA值）

2 个答案: