Question

我在CSV文件中有几列后缀_id。它们大多数是整数，但有时它们没有值（NaN-CSV文件中为空）。例如，某些名称可能是：

user_id
device_id
company_id
country_id
...

当我使用pd.read_csv时，它将其中一些识别为int64，而另一些识别为float64。我不确定，但我认为这些标识符的“正确”类型为object。

我知道read_csv具有dtype属性。所以我可以做的是：

dtypes = {'user_id': str,
          'device_id': str,
          'company_id': str,
          'country_id': str}
df = pd.read_csv('file.csv', dtype=dtypes)

它的缺点是我必须指定所有列。有没有办法将其应用于所有“ * _id”列？如果后来我使用df['user_id'].astype(object)（当熊猫已经“识别”了int64或float64时）会有所不同吗？

Answer 1

您可以使用_id对提取列进行预处理，将其转换为字典，然后使用red_csv：

path_csv = 'file.csv'
with open(path_csv) as f:
    lines = f.readlines()
    #get header to list
    header = [x.strip() for x in lines[0].split(',') if '_id' in x]

print (header)
['user_id', 'device_id', 'company_id', 'country_id']

d = dict.fromkeys(header, 'str')
print (d)
{'user_id': 'str', 'device_id': 'str', 'company_id': 'str', 'country_id': 'str'}

df = pd.read_csv('file.csv', dtype=d)
print (df)
  user_id device_id company_id country_id  country
0       1         2          3          4        5

print (df.dtypes)
user_id       object
device_id     object
company_id    object
country_id    object
country        int64
dtype: object

如果后来熊猫使用了“ df ['user_id']。astype（object）（已经识别出int64或float64），是否有区别？

这取决于数据，但是如果某些NaN具有整数，则应将int转换为float。还应更改float的精度。

使用Pandas读取CSV时应如何处理** id列？

1 个答案: