Question

我有一个数据框，该数据框的列中包含字符和数字。尺寸为690x12。 dataFrame看起来像这样：

  A1   A2     A3   A4   A5  .....
  b    12.33  c    110  +   ......
  a    3.52   q    65   -   ......  
  a    7.44   p    98   +   ......
  a    5.01   q    54   -
  b    10.87  p    33   -

我的任务是对所有包含字符的列进行标签编码，然后返回新的数据框。

直到现在我都尝试过这样的事情：

dat = dataC

for column in dat:
    col = dat[column]
    temp = pd.to_numeric(col, errors = 'coerce')

    if(temp.isna().sum() == col.size):
        col1 = LabelEncoder().fit_transform(col)
        col1 = pd.DataFrame(col1).astype('int64')
        dat[column] = np.where(1, col1, dat[column])

dat.dtypes

输出是完美的，看起来像：

  A1   A2     A3   A4   A5  .....
  1    12.33  0    110  0   ......
  0    3.52   2    65   1   ......  
  0    7.44   1    98   0   ......
  0    5.01   2    54   1
  1    10.87  1    33   1

但是当我打印dat的dtypes时：

 object
 float64
 object
 int64
 object

我希望标签编码的数据是int64而不是对象，但是我的代码似乎无法正常工作。我该怎么办？

TIA

Answer 1

1。您可以使用astype('int64')通过以下功能检查columns ：

def ObjectToInt64(df):
    for i in df.columns:
        if isinstance(df.loc[df.index[0],i],int):
            df[i]=df[i].astype('int64')

ObjectToInt64(dat)
dat.info()

注意：检查对象类型列的类型，如果这些类型元素与int不同，然后将int（在isistance()中）替换为对应的类型。在我的示例中，您可以看到如何进行验证。

2。示例：

s1 = pd.Series([3,4],dtype='object')
s2 = pd.Series([5,4],dtype='int32')
s3=  pd.Series([1,4],dtype='int64')
df=pd.concat([s1,s2,s3],axis=1)

类型输出：

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
0    2 non-null object
1    2 non-null int32
2    2 non-null int64
dtypes: int32(1), int64(1), object(1)
memory usage: 120.0+ bytes

现在使用：

def ObjectToInt64(df):
    for i in df.columns:
        if isinstance(df.loc[df.index[0],i],int):
            df[i]=df[i].astype('int64')

ObjectToInt64(df)
df.info()

类型输出：

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
0    2 non-null int64
1    2 non-null int32
2    2 non-null int64
dtypes: int32(1), int64(2)
memory usage: 120.0 bytes

3为什么要这样做？

type(df[0][0])

输出：

int

type(df[1][0])

输出：

numpy.int32

标签编码后如何更改数据框中列的dtype？

1 个答案: