Question

我需要按列扫描CSV并找到最强的数据类型，然后将其应用于整个列。

例如，如果我的CSV看起来像这样（是的，我没有逗号......）：

    + C1 + C2 + C3 + C4
R1  | i  | s  | i  | f
R2  | i  | f  | i  | i
R3  | i  | i  | s  | f

# i = int
# f = float
# s = str

＆＃34;最强＆＃34; C1的类型为i，C2为s，C3为s，C4为{{} 1}}。

遵循＆＃34;强度＆＃34;的顺序。是f。

为什么呢？因为我写这些值的文件类型明确要求为字段（它的列）指定的数据类型匹配该数据类型（即如果字段设置为str > float > int，我不能放该列中的FLOAT，否则文件无效。

要完成此任务，我将执行以下操作：

对于每个文件，逐行读取文件并检查每一列;商店＆＃34;最强＆＃34;输入
创建一个包含新类型转换行的新容器

第2项与字典和列表理解非常简单：

str

第1项是大部分举重的地方：

types = {header: None for header in r.fieldnames}
# read file and store "strongest" found in 'types[header]' per column
# ...
typed = [[types[header](row[header]) for header in types] for row in rows]
# note: types[header] value is a function alias (i.e. int vs int())

执行此操作的最坏情况是CSV中的行数，因为最后一行可能包含有效的for row in r: # r is a csv.DictReader rows.append(row) # list of OrderedDicts since r is a generator # problematic because I have to keep checking just to append... if all(types[header] is str for header in types): continue # all 'str' so stop checking for header in types: if types[header] is str: continue # whole column can be bypassed from now on # function just type casts 'int' or 'float' on string by ValueError t = self.find_type(row[header]) if (types[header] is int) and (t is float): types[header] = t # float > int since all int's can be represented as float elif (types[header] is float) and (t is int): pass # int < float so do nothing else: types[header] = t # if 'str' will be caught later by first if类型测试。

是否有更有效的方式来执行此操作，可能使用str（目前使用不多）？

解决方案：

pandas

Answer 1

示例DataFrame：

In [11]: df
Out[11]:
    C1  C2 C3    C4
R1   1   a  6   8.0
R2   2  4.  7   9.0
R3   3   5  b  10.0

我不会尝试做任何短路评估的聪明人。我只是采用每个条目的类型：

In [12]: df_types = df.applymap(type)

In [13]: df_types
Out[13]:
               C1             C2             C3               C4
R1  <class 'int'>  <class 'str'>  <class 'str'>  <class 'float'>
R2  <class 'int'>  <class 'str'>  <class 'str'>  <class 'float'>
R3  <class 'int'>  <class 'str'>  <class 'str'>  <class 'float'>

如果您枚举这些类型，则可以使用max：

In [14]: d = {ch: i for i, ch in enumerate([int, float, str])}

In [15]: d_inv = {i: ch for i, ch in enumerate([int, float, str])}

In [16]: df_types.applymap(d.get)
Out[16]:
    C1  C2  C3  C4
R1   0   2   2   1
R2   0   2   2   1
R3   0   2   2   1

In [17]: df_types.applymap(d.get).max()
Out[17]:
C1    0
C2    2
C3    2
C4    1
dtype: int64

In [18]: df_types.applymap(d.get).max().apply(d_inv.get)
Out[18]:
C1      <class 'int'>
C2      <class 'str'>
C3      <class 'str'>
C4    <class 'float'>
dtype: object

现在，您可以遍历每一列并在df（最大）中更新它：

In [21]: for col, typ in df_types.applymap(d.get).max().apply(d_inv.get).iteritems():
             df[col] = df[col].astype(typ)


In [22]: df
Out[22]:
    C1  C2 C3    C4
R1   1   a  6   8.0
R2   2  4.  7   9.0
R3   3   5  b  10.0

In [23]: df.dtypes
Out[23]:
C1      int64
C2     object
C3     object
C4    float64
dtype: object

如果您通过按类型分组并更新批次列（例如，一次性显示所有字符串列）而有许多列，则可以稍微提高效率。

找到最强的＆＃39; CSV

1 个答案: