Question

我正在查看此处发现的Kaggle比赛中着名的泰坦尼克号数据集：http://www.kaggle.com/c/titanic-gettingStarted/data

我使用以下方式加载和处理数据：

# import required libraries
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# load the data from the file
df = pd.read_csv('./data/train.csv')

# import the scatter_matrix functionality
from pandas.tools.plotting import scatter_matrix

# define colors list, to be used to plot survived either red (=0) or green (=1)
colors=['red','green']

# make a scatter plot
scatter_matrix(df,figsize=[20,20],marker='x',c=df.Survived.apply(lambda x:colors[x]))

df.info()

scatter_matrix from matplotlib

如何在情节中添加Sex and Embarked等分类栏？

Answer 1

您需要将分类变量转换为数字以绘制它们。

示例（假设“Sex”栏中包含性别数据，男性为“M”，女性为“F”）

df['Sex_int'] = np.nan
df.loc[df['Sex'] == 'M', 'Sex_int'] = 0
df.loc[df['Sex'] == 'F', 'Sex_int'] = 1

现在所有女性都由0＆amp;男性用1.未知性别（如果有的话）将被忽略。

其余代码应该很好地处理更新的数据帧。

Answer 2

在谷歌搜索并记住像.map（）函数之类的东西后，我按照以下方式修复它：

colors=['red','green'] # color codes for survived : 0=red or 1=green

# create mapping Series for gender so it can be plotted
gender = Series([0,1],index=['male','female'])    
df['gender']=df.Sex.map(gender)

# create mapping Series for Embarked so it can be plotted
embarked = Series([0,1,2,3],index=df.Embarked.unique())
df['embarked']=df.Embarked.map(embarked)

# add survived also back to the df
df['survived']=target

现在我可以再次绘制它...然后删除添加的列。

感谢大家的回应......

Answer 3

这是我的解决方案：

# convert string column to category
df.Sex = df.Sex.astype('category')
# create additional column for its codes
df['Sex_code'] = df_clean.Sex.cat.codes

Pandas scatter_matrix - 绘制分类变量

3 个答案: