我正在尝试根据键(target_names,target& DESCR)加载sklearn.dataset并丢失一列。我已经尝试了各种方法来包含最后一列,但有错误。
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print cancer.keys()
键是['target_names','data','target','DESCR','feature_names']
data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
print data.describe()
使用上面的代码,当我需要31列时,它只返回30列。将scikit-learn数据集加载到pandas DataFrame中的最佳方法是什么。
答案 0 :(得分:8)
创建包含要素和目标变量的数据框的另一个选项是:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
df = pd.DataFrame(np.c_[cancer['data'], cancer['target']],
columns= np.append(cancer['feature_names'], ['target']))
答案 1 :(得分:2)
如果您想要target
列,则需要添加它,因为它不在cancer.data
中。 cancer.target
的列包含0
或1
,cancer.target_names
包含标签。我希望以下是你想要的:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print cancer.keys()
data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
print data.describe()
data = data.assign(target=pd.Series(cancer.target))
print data.describe()
# In case you want labels instead of numbers.
data.replace(to_replace={'target': {0: cancer.target_names[0]}}, inplace=True)
data.replace(to_replace={'target': {1: cancer.target_names[1]}}, inplace=True)
print data.shape # data.describe() won't show the "target" column here because I converted its value to string.
答案 2 :(得分:2)
这也有效,也使用pd.Series。
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print cancer.keys()
data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
data['Target'] = pd.Series(data=cancer.target, index=data.index)
print data.keys()
print data.shape
答案 3 :(得分:1)
仅缺少目标列,因此您可以添加一列。
df = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
df['target'] = cancer.target
答案 4 :(得分:0)
映射目标名称
data["target"] = pd.Categorical(pd.Series(cancer.target).map(lambda x: cancer.target_names[x]))
答案 5 :(得分:0)
从scikit-learn 0.23开始,您可以执行以下操作以获取包含目标列的DataFrame。
df = load_breast_cancer(as_frame=True)
df.frame