将SKLearn癌症数据集加载到Pandas DataFrame中

时间:2017-06-03 04:58:10

标签: python numpy scikit-learn

我正在尝试根据键(target_names,target& DESCR)加载sklearn.dataset并丢失一列。我已经尝试了各种方法来包含最后一列,但有错误。

 import numpy as np
 import pandas as pd
 from sklearn.datasets import load_breast_cancer

 cancer = load_breast_cancer()
 print cancer.keys()
  

键是['target_names','data','target','DESCR','feature_names']

 data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
 print data.describe()

使用上面的代码,当我需要31列时,它只返回30列。将scikit-learn数据集加载到pandas DataFrame中的最佳方法是什么。

6 个答案:

答案 0 :(得分:8)

创建包含要素和目标变量的数据框的另一个选项是:

import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
df = pd.DataFrame(np.c_[cancer['data'], cancer['target']],
                  columns= np.append(cancer['feature_names'], ['target']))

答案 1 :(得分:2)

如果您想要target列,则需要添加它,因为它不在cancer.data中。 cancer.target的列包含01cancer.target_names包含标签。我希望以下是你想要的:

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
print cancer.keys()

data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
print data.describe()

data = data.assign(target=pd.Series(cancer.target))
print data.describe()

# In case you want labels instead of numbers.
data.replace(to_replace={'target': {0: cancer.target_names[0]}}, inplace=True)
data.replace(to_replace={'target': {1: cancer.target_names[1]}}, inplace=True)
print data.shape # data.describe() won't show the "target" column here because I converted its value to string.

答案 2 :(得分:2)

这也有效,也使用pd.Series。

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
print cancer.keys()

data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
data['Target'] = pd.Series(data=cancer.target, index=data.index)

print data.keys()
print data.shape

答案 3 :(得分:1)

仅缺少目标列,因此您可以添加一列。

df =  pd.DataFrame(cancer.data, columns=[cancer.feature_names])
df['target'] = cancer.target

答案 4 :(得分:0)

使用 map()

可以优雅地处理

映射目标名称

data["target"] = pd.Categorical(pd.Series(cancer.target).map(lambda x: cancer.target_names[x]))

答案 5 :(得分:0)

从scikit-learn 0.23开始,您可以执行以下操作以获取包含目标列的DataFrame。

df = load_breast_cancer(as_frame=True)
df.frame