例如,我有一个mushroom数据集,其中包含数十个分类功能。我想在pandas.DataFrame中加载它并转换为数字。样本的特征存储在列中,行代表不同的样本。因此,应将数字转换应用于列。在R中,我只需要两行代码:
#Load the data. The features are categorical.
mushrooms <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data", header = FALSE, stringsAsFactors = TRUE)
#Convert the features to numeric. The features are stored in columns.
mushroomsNumeric <- data.frame(lapply(mushrooms, as.numeric))
# View the first 5 samples of the original data.
mushrooms[1:5,]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23
1 p x s n t p f c n k e e s s w w p w o p k s u
2 e x s y t a f c b k e c s s w w p w o p n n g
3 e b s w t l f c b n e c s s w w p w o p n n m
4 p x y w t p f c n n e e s s w w p w o p k s u
5 e x s g f n f w b k t e s s w w p w o e n a g
# View the first 5 samples of the converted data.
mushroomsNumeric[1:5,]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23
1 2 6 3 5 2 7 2 1 2 5 1 4 3 3 8 8 1 3 2 5 3 4 6
2 1 6 3 10 2 1 2 1 1 5 1 3 3 3 8 8 1 3 2 5 4 3 2
3 1 1 3 9 2 4 2 1 1 6 1 3 3 3 8 8 1 3 2 5 4 3 4
4 2 6 4 9 2 7 2 1 2 6 1 4 3 3 8 8 1 3 2 5 3 4 6
5 1 6 3 4 1 6 2 2 1 5 2 4 3 3 8 8 1 3 2 1 4 1 2
使用pandas.DataFrame在Python中执行相同操作的最快方法是什么?谢谢!
答案 0 :(得分:3)
您还可以使用sklearn
库中的LabelEncoder
。
from sklearn.preprocessing import LabelEncoder
lbl = LabelEncoder()
# sample data
df = pd.DataFrame({'V1': ['a','b','a','d'],
'V2':['c','d','d','c']})
# apply function
df.apply(lbl.fit_transform)
V1 V2
0 0 0
1 1 1
2 0 1
3 2 0
答案 1 :(得分:2)
使用pd.factorize
def f(x):
return pd.factorize(x)[0]
用于分解列
df.apply(f)
用于分解行
df.apply(f, 1)
将整个数据帧分解为
pd.DataFrame(
pd.factorize(df.values.ravel())[0].reshape(df.shape),
df.index, df.columns
)
答案 2 :(得分:0)
以下是基于先前答案的两种不同解决方案的摘要,以及它们在我的案例中的表现。
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Load the data with categorical features.
mushrooms = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data", header = None)
# Convert the categorical features to numeric: solution 1.
labelEncoder = LabelEncoder()
mushroomsNumeric = mushrooms.apply(labelEncoder.fit_transform)
# Convert the categorical features to numeric: solution 2.
mushroomsNumeric2 = pd.DataFrame(
pd.factorize(mushrooms.values.ravel())[0].reshape(mushrooms.shape),
mushrooms.index, mushrooms.columns)
mushroomsNumeric.head(5)
Out[35]:
0 1 2 3 4 5 6 7 8 9 ... 13 14 15 16 17 18 19 20 \
0 1 5 2 4 1 6 1 0 1 4 ... 2 7 7 0 2 1 4 2
1 0 5 2 9 1 0 1 0 0 4 ... 2 7 7 0 2 1 4 3
2 0 0 2 8 1 3 1 0 0 5 ... 2 7 7 0 2 1 4 3
3 1 5 3 8 1 6 1 0 1 5 ... 2 7 7 0 2 1 4 2
4 0 5 2 3 0 5 1 1 0 4 ... 2 7 7 0 2 1 0 3
21 22
0 3 5
1 2 1
2 2 3
3 3 5
4 0 1
[5 rows x 23 columns]
mushroomsNumeric2.head(5)
Out[36]:
0 1 2 3 4 5 6 7 8 9 ... 13 14 15 16 17 18 19 20 \
0 0 1 2 3 4 0 5 6 3 7 ... 2 9 9 0 9 10 0 7
1 8 1 2 12 4 13 5 6 14 7 ... 2 9 9 0 9 10 0 3
2 8 14 2 9 4 16 5 6 14 3 ... 2 9 9 0 9 10 0 3
3 0 1 12 9 4 0 5 6 3 3 ... 2 9 9 0 9 10 0 7
4 8 1 2 15 5 3 5 9 14 7 ... 2 9 9 0 9 10 8 3
21 22
0 2 11
1 3 15
2 3 17
3 2 11
4 13 15
[5 rows x 23 columns]