Question

我正在对一个包含330万行的文件进行迭代，以检查该列的数据类型，并根据是否包含整数来执行操作。

虽然对于issubdtype为np.integer的像a55950602，a92300416这样的单元格值很容易识别为False，但在ga99266e的情况下它会失败。

代码：将熊猫作为pd导入将numpy导入为np 导入时间导入数学

start_time = time.time()
lstNumberCounts = []
lstIllFormed = []

dfClicks = pd.read_csv('Oct3_distinct_Members.csv')
dfClicks['UNIV_MBR_ID'] = dfClicks['UNIV_MBR_ID'].str.split('-').str[0]
dfClicks['UNIV_MBR_ID'] = dfClicks['UNIV_MBR_ID'].apply(pd.to_numeric,errors='ignore')

for item in dfClicks['UNIV_MBR_ID']:
    if (np.issubdtype(item,np.integer)):
        lstNumberCounts.append(math.floor(math.log10(item))+1)
else:
    lstIllFormed.append(item)


print("---Processing Time: %s seconds ---" % (time.time() - start_time))

代码对于上面提到的值运行良好，但是如下所示会在控制台上引发错误： TypeError：数据类型“ ga99266e”无法理解

Answer 1

第pd.to_numeric,errors='ignore'行returns either a numeric value or the input。因此，对于“ ga99266e”，它将返回字符串“ ga99266e”。如果您输入numpys issubdtype字符串，则为it checks if the string is the name of a dtype。（例如np.issubdtype（'int'，int）返回True）。

因此，您需要首先检查您的字段是否仍然是字符串，然后如果不是，则可以检查它是否是一个numpy整数。

尝试：

import pandas as pd 
import numpy as np 
import time 
import math
start_time = time.time()
lstNumberCounts = []
lstIllFormed = []

dfClicks = pd.read_csv('Oct3_distinct_Members.csv')
dfClicks['UNIV_MBR_ID'] = dfClicks['UNIV_MBR_ID'].str.split('-').str[0]
dfClicks['UNIV_MBR_ID'] = dfClicks['UNIV_MBR_ID'].apply(pd.to_numeric,errors='ignore')

for item in dfClicks['UNIV_MBR_ID']:
    if not (isinstance(item,str)):
        if (np.issubdtype(item,np.integer)):
            lstNumberCounts.append(math.floor(math.log10(item))+1)
    else:
        lstIllFormed.append(item)


print("---Processing Time: %s seconds ---" % (time.time() - start_time))

“ a123456”或任何以“ a”开头的字符串都可以与np.issubdtype一起使用，因为numpy会将其解释为告诉其以下数字是哪种数字的代码。 See:

数组协议类型的字符串（请参见数组接口）

第一个字符指定数据的类型，其余字符指定每个项目的字节数，但Unicode除外，Unicode将其解释为字符数。项目大小必须与现有的类型相对应，否则将引发错误。支持的种类是

'？'布尔值

'b'（带符号）字节

'B'无符号字节

“ i”（有符号）整数

'u'无符号整数

'f'浮点数

'c'浮点数

'm'timedelta

“ M”日期时间

'O'（Python）对象

'S'，'a'零终止字节（不推荐）

“ U” Unicode字符串

“ V”原始数据（无效）

奇怪的numpy issubdtype行为

1 个答案: