我需要按列扫描CSV并找到最强的数据类型,然后将其应用于整个列。
例如,如果我的CSV看起来像这样(是的,我没有逗号......):
+ C1 + C2 + C3 + C4
R1 | i | s | i | f
R2 | i | f | i | i
R3 | i | i | s | f
# i = int
# f = float
# s = str
"最强" C1
的类型为i
,C2
为s
,C3
为s
,C4
为{{} 1}}。
遵循"强度"的顺序。是f
。
为什么呢?因为我写这些值的文件类型明确要求为字段(它的列)指定的数据类型匹配该数据类型(即如果字段设置为str > float > int
,我不能放该列中的FLOAT
,否则文件无效。
要完成此任务,我将执行以下操作:
第2项与字典和列表理解非常简单:
str
第1项是大部分举重的地方:
types = {header: None for header in r.fieldnames}
# read file and store "strongest" found in 'types[header]' per column
# ...
typed = [[types[header](row[header]) for header in types] for row in rows]
# note: types[header] value is a function alias (i.e. int vs int())
执行此操作的最坏情况是CSV中的行数,因为最后一行可能包含有效的for row in r: # r is a csv.DictReader
rows.append(row) # list of OrderedDicts since r is a generator
# problematic because I have to keep checking just to append...
if all(types[header] is str for header in types):
continue # all 'str' so stop checking
for header in types:
if types[header] is str:
continue # whole column can be bypassed from now on
# function just type casts 'int' or 'float' on string by ValueError
t = self.find_type(row[header])
if (types[header] is int) and (t is float):
types[header] = t # float > int since all int's can be represented as float
elif (types[header] is float) and (t is int):
pass # int < float so do nothing
else:
types[header] = t # if 'str' will be caught later by first if
类型测试。
是否有更有效的方式来执行此操作,可能使用str
(目前使用不多)?
解决方案:
pandas
答案 0 :(得分:2)
示例DataFrame:
In [11]: df
Out[11]:
C1 C2 C3 C4
R1 1 a 6 8.0
R2 2 4. 7 9.0
R3 3 5 b 10.0
我不会尝试做任何短路评估的聪明人。我只是采用每个条目的类型:
In [12]: df_types = df.applymap(type)
In [13]: df_types
Out[13]:
C1 C2 C3 C4
R1 <class 'int'> <class 'str'> <class 'str'> <class 'float'>
R2 <class 'int'> <class 'str'> <class 'str'> <class 'float'>
R3 <class 'int'> <class 'str'> <class 'str'> <class 'float'>
如果您枚举这些类型,则可以使用max
:
In [14]: d = {ch: i for i, ch in enumerate([int, float, str])}
In [15]: d_inv = {i: ch for i, ch in enumerate([int, float, str])}
In [16]: df_types.applymap(d.get)
Out[16]:
C1 C2 C3 C4
R1 0 2 2 1
R2 0 2 2 1
R3 0 2 2 1
In [17]: df_types.applymap(d.get).max()
Out[17]:
C1 0
C2 2
C3 2
C4 1
dtype: int64
In [18]: df_types.applymap(d.get).max().apply(d_inv.get)
Out[18]:
C1 <class 'int'>
C2 <class 'str'>
C3 <class 'str'>
C4 <class 'float'>
dtype: object
现在,您可以遍历每一列并在df
(最大)中更新它:
In [21]: for col, typ in df_types.applymap(d.get).max().apply(d_inv.get).iteritems():
df[col] = df[col].astype(typ)
In [22]: df
Out[22]:
C1 C2 C3 C4
R1 1 a 6 8.0
R2 2 4. 7 9.0
R3 3 5 b 10.0
In [23]: df.dtypes
Out[23]:
C1 int64
C2 object
C3 object
C4 float64
dtype: object
如果您通过按类型分组并更新批次列(例如,一次性显示所有字符串列)而有许多列,则可以稍微提高效率。