在Python中将具有多个功能的分类数据转换为数字的最快方法是什么?

时间:2018-04-04 16:35:30

标签: python pandas

例如,我有一个mushroom数据集,其中包含数十个分类功能。我想在pandas.DataFrame中加载它并转换为数字。样本的特征存储在列中,行代表不同的样本。因此,应将数字转换应用于列。在R中,我只需要两行代码:

#Load the data. The features are categorical.
mushrooms <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data", header = FALSE, stringsAsFactors = TRUE)

#Convert the features to numeric. The features are stored in columns.
mushroomsNumeric <- data.frame(lapply(mushrooms, as.numeric))

# View the first 5 samples of the original data.
mushrooms[1:5,]
  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23
1  p  x  s  n  t  p  f  c  n   k   e   e   s   s   w   w   p   w   o   p   k   s   u
2  e  x  s  y  t  a  f  c  b   k   e   c   s   s   w   w   p   w   o   p   n   n   g
3  e  b  s  w  t  l  f  c  b   n   e   c   s   s   w   w   p   w   o   p   n   n   m
4  p  x  y  w  t  p  f  c  n   n   e   e   s   s   w   w   p   w   o   p   k   s   u
5  e  x  s  g  f  n  f  w  b   k   t   e   s   s   w   w   p   w   o   e   n   a   g

# View the first 5 samples of the converted data.  
mushroomsNumeric[1:5,]
  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23
1  2  6  3  5  2  7  2  1  2   5   1   4   3   3   8   8   1   3   2   5   3   4   6
2  1  6  3 10  2  1  2  1  1   5   1   3   3   3   8   8   1   3   2   5   4   3   2
3  1  1  3  9  2  4  2  1  1   6   1   3   3   3   8   8   1   3   2   5   4   3   4
4  2  6  4  9  2  7  2  1  2   6   1   4   3   3   8   8   1   3   2   5   3   4   6
5  1  6  3  4  1  6  2  2  1   5   2   4   3   3   8   8   1   3   2   1   4   1   2

使用pandas.DataFrame在Python中执行相同操作的最快方法是什么?谢谢!

3 个答案:

答案 0 :(得分:3)

您还可以使用sklearn库中的LabelEncoder

from sklearn.preprocessing import LabelEncoder
lbl = LabelEncoder()

# sample data
df = pd.DataFrame({'V1': ['a','b','a','d'],
                   'V2':['c','d','d','c']})

# apply function
df.apply(lbl.fit_transform)

   V1   V2
0   0   0
1   1   1
2   0   1
3   2   0

答案 1 :(得分:2)

使用pd.factorize

def f(x):
    return pd.factorize(x)[0]

用于分解列

df.apply(f)

用于分解行

df.apply(f, 1)

将整个数据帧分解为

pd.DataFrame(
    pd.factorize(df.values.ravel())[0].reshape(df.shape),
    df.index, df.columns
)

答案 2 :(得分:0)

以下是基于先前答案的两种不同解决方案的摘要,以及它们在我的案例中的表现。

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Load the data with categorical features.
mushrooms = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data", header = None)

# Convert the categorical features to numeric: solution 1.
labelEncoder = LabelEncoder()
mushroomsNumeric = mushrooms.apply(labelEncoder.fit_transform)

# Convert the categorical features to numeric: solution 2.
mushroomsNumeric2 = pd.DataFrame(
    pd.factorize(mushrooms.values.ravel())[0].reshape(mushrooms.shape),
    mushrooms.index, mushrooms.columns)

mushroomsNumeric.head(5)
Out[35]: 
   0   1   2   3   4   5   6   7   8   9  ...  13  14  15  16  17  18  19  20  \
0   1   5   2   4   1   6   1   0   1   4 ...   2   7   7   0   2   1   4   2   
1   0   5   2   9   1   0   1   0   0   4 ...   2   7   7   0   2   1   4   3   
2   0   0   2   8   1   3   1   0   0   5 ...   2   7   7   0   2   1   4   3   
3   1   5   3   8   1   6   1   0   1   5 ...   2   7   7   0   2   1   4   2   
4   0   5   2   3   0   5   1   1   0   4 ...   2   7   7   0   2   1   0   3   

   21  22  
0   3   5  
1   2   1  
2   2   3  
3   3   5  
4   0   1  

[5 rows x 23 columns]

mushroomsNumeric2.head(5)
Out[36]: 
   0   1   2   3   4   5   6   7   8   9  ...  13  14  15  16  17  18  19  20  \
0   0   1   2   3   4   0   5   6   3   7 ...   2   9   9   0   9  10   0   7   
1   8   1   2  12   4  13   5   6  14   7 ...   2   9   9   0   9  10   0   3   
2   8  14   2   9   4  16   5   6  14   3 ...   2   9   9   0   9  10   0   3   
3   0   1  12   9   4   0   5   6   3   3 ...   2   9   9   0   9  10   0   7   
4   8   1   2  15   5   3   5   9  14   7 ...   2   9   9   0   9  10   8   3   

   21  22  
0   2  11  
1   3  15  
2   3  17  
3   2  11  
4  13  15  

[5 rows x 23 columns]