如何自动将大量分类数据从字符串转换为数值?

时间:2019-01-11 08:00:53

标签: python machine-learning decision-tree

我正在尝试建立决策树回归模型,以预测汽车的MSRP(制造商建议零售价)值。但是,我在将分类值转换为数值时遇到问题。

我的问题: 我有8列分类功能,有些列具有多达40种不同类型的唯一值和20,000个实例。我应该使用哪种方法来转换分类数据以用于决策树回归?还有什么方法可以自动输入唯一值,而不是手动输入?

我尝试使用LabelEncoder转换分类值,但是由于某种原因,即使转换后,第一列中df.values的数组(宝马、,歌...)也没有改变。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
df = pd.read_excel(r'C:\Users\user\Desktop\data.xlsx')
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
df.values[:, 0] = labelencoder.fit_transform(df.values[:, 0])

这是我得到的结果:

array([['BMW', '1 Series M', 2011, ..., 19, 3916, 46135],
       ['BMW', '1 Series', 2011, ..., 19, 3916, 40650],
       ['BMW', '1 Series', 2011, ..., 20, 3916, 36350],
       ...,
       ['Acura', 'ZDX', 2012, ..., 16, 204, 50620],
       ['Acura', 'ZDX', 2013, ..., 16, 204, 50920],
       ['Lincoln', 'Zephyr', 2006, ..., 17, 61, 28995]], dtype=object)

我希望第一列具有用于DT回归的数值。 有人可以帮忙吗?我正在FYP中这样做,这是我第一次接触机器学习。

2 个答案:

答案 0 :(得分:2)

有多种方法可以使用pandas和sklearn将分类数据转换为数字:

  
      
  1. pandas.get_dummies()(一种热门编码)
      示例:
  2.   
import numpy as np
import pandas as pd

df = pd.DataFrame([['BMW', '1 Series M', 2011, 19, 3916, 46135],
       ['BMW', '1 Series', 2011,19, 3916, 40650],
       ['BMW', '1 Series', 2011,20, 3916, 36350],
       ['Acura', 'ZDX', 2012, 16, 204, 50620],
       ['Acura', 'ZDX', 2013, 16, 204, 50920],
       ['Lincoln', 'Zephyr', 2006, 17, 61, 28995]]) #Sample dataframe

pd.get_dummies(df, columns = [0,1,2]) #Dummies of 1st,2nd and 3rd column
  

输出
  Output

     

2。LabelEncoder
  示例

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame([['BMW', '1 Series M', 2011, 19, 3916, 46135],
       ['BMW', '1 Series', 2011,19, 3916, 40650],
       ['BMW', '1 Series', 2011,20, 3916, 36350],
       ['Acura', 'ZDX', 2012, 16, 204, 50620],
       ['Acura', 'ZDX', 2013, 16, 204, 50920],
       ['Lincoln', 'Zephyr', 2006, 17, 61, 28995]]) #Sample dataframe

df[[0,1,2]].apply(LabelEncoder().fit_transform)
  

输出(它将仅提供需要与原始数据帧组合的转换列)   enter image description here

df.loc[0:,0:2] = df[[0,1,2]].apply(LabelEncoder().fit_transform) 
#puts column back into dataframe
  

输出   enter image description here

答案 1 :(得分:0)

实际上,您是以错误的方式为您分配数据 df.values [:, 0] ,仅尝试 df [:, 0]

java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at myproject.getMetadataFromPod(MyClass.java:295)
at myproject.MyClass.lambda$zookeeperData$5(MyClass.java:337)
at io.vertx.core.impl.ContextImpl.lambda$executeBlocking$1(ContextImpl.java:273)
at io.vertx.core.impl.TaskQueue.run(TaskQueue.java:76)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)