如何根据标准创建新列并为其指定值?

时间:2014-11-10 22:28:38

标签: python pandas

我的数据框如下所示:

         lname    fname   rno_cd    eri_cd
    0    CRUISE   TOM     E         1
    1    DEPP     JOHNNY  Y         0
    2    DICAPR   LENARDO           1
    3    PITT     BRAD              1
    4    MOST     JEFF    A         0
    5    HANKS    TOM               1
    6    BRANDO   MARLON  C         1
    7    WILLIAMS ROBIN   F         1
    8    DOWNEY   ROBERT  B         1
    9    PACINO   AL      E         1

列['rno_cd']中的代码定义为:

 A = AI/AK Native
 B = Asian
 C = Black/AA
 D = Hispanic
 E = White
 F = Asian
 G = Asian
 H = Haw/Pac Isl.
 Y = White

1)我需要定义这些代码并放在一个新列中 2)我还需要以某种方式解释空白值

最终结果如下:

         lname    fname   rno_cd    eri_cd  rno_defined
    0    CRUISE   TOM     E         1       White
    1    DEPP     JOHNNY  Y         0       White
    2    DICAPR   LENARDO           1       Unknown
    3    PITT     BRAD              1       Unknown   
    4    MOST     JEFF    A         0       AI/AK Native  
    5    HANKS    TOM               1       Unknown
    6    BRANDO   MARLON  C         1       Black/AA
    7    WILLIAMS ROBIN   F         1       Asian
    8    DOWNEY   ROBERT  B         1       Asian
    9    PACINO   AL      E         1       White

======================我的编码很快==================

我使用了以下内容,但不确定它是否是一个可靠的解决方案。

In[1]: 
    df1['rno_cd'][df1.rno_cd.str.contains('A')] = 'AI/AK Native'
    df1['rno_cd'][df1.rno_cd.str.contains('B')] = 'Asian'
    df1['rno_cd'][df1.rno_cd.str.contains('C')] = 'Black/AA'
    df1['rno_cd'][df1.rno_cd.str.contains('D')] = 'Hispanic'
    df1['rno_cd'][df1.rno_cd.str.contains('E')] = 'White'
    df1['rno_cd'][df1.rno_cd.str.contains('F')] = 'Asian'
    df1['rno_cd'][df1.rno_cd.str.contains('G')] = 'Asian'
    df1['rno_cd'][df1.rno_cd.str.contains('H')] = 'HawPac'
    df1['rno_cd'][df1.rno_cd.str.contains('Y')] = 'White'


In[1]:  df1
Out[1]:  


         lname      fname      rno_cd   eri_cd
    0    SONJU      LAURIE     White     1
    1    FORTHOFER  KELLY      White     0
    2    PLILEY     JODY                 1
    3    NOEL       HEATHER              1
    4    MANNING    CYNTHIA    White     0
    5    NAUERTZ    ELIZABETH            1
    6    SCHMID     DAVID      White     1
    7    HINTHER    VICTORIA   White     1
    8    JOHNSON    B.         White     1
    9    MOORE      CAROL      White     1
    10   MARSHALL   JOY                  1

此代码的限制是它不会为原始数据集中的空白值赋值。我也看不到原始代码来验证值是否正确。

有任何建议/意见/建议吗?

感谢。

2 个答案:

答案 0 :(得分:2)

系列(例如,DataFrame的列)具有方便的map方法。您只需要以字典形式进行编码:

 code_to_ethnicity: {'A': 'AI/AK Native',
                     'B': 'Asian'} #etc
df['rno_defined'] = df['rno_cd'].map(code_to_ethnicity)

当您描述'空白值'时,我认为您的意思是空字符串:''。如果你想为这些做一些特殊的事情,你可以直接将它添加到字典中。

 code_to_ethnicity: {'A': 'AI/AK Native',
                     'B': 'Asian',
                     '': 'other}

答案 1 :(得分:1)

您可以构建一个字典,其中键是引用,值是名称。

D={"A":"AI/AK Native","B":"Asian","C":"Black/AA","D":"Hispanic","E":"White","F":"Asian","G":"Asian","H":"Haw/Pac Isl","Y":"White"}

然后浏览rno_cd列,并应用转换数据的函数。您可以使用apply和函数tranform来验证x是否为密钥,以便使用字典D[x]获取值,如果不是这样,则只返回{{1} }

"unknown"

另一种方法:

data="""lname    fname   rno_cd    eri_cd
0    CRUISE   TOM     E         1
1    DEPP     JOHNNY  Y         0
2    DICAPR   LENARDO Nan       1
3    PITT     BRAD    Nan       1
4    MOST     JEFF    A         0
5    HANKS    TOM     Nan       1
6    BRANDO   MARLON  C         1
7    WILLIAMS ROBIN   F         1
8    DOWNEY   ROBERT  B         1
9    PACINO   AL      E         1"""

import pandas as pd
from collections import Counter
from io import StringIO

df= pd.read_csv(StringIO(data.decode('UTF-8')),delim_whitespace=True )


D={"A":"AI/AK Native","B":"Asian","C":"Black/AA","D":"Hispanic","E":"White","F":"Asian","G":"Asian","H":"Haw/Pac Isl","Y":"White"}

def transform(x):
    if x['rno_cd']=="Nan":
        return "Unknown"
    else:
        return D[x['rno_cd']]

df["rno_defined"]= df.apply(lambda x: transform(x) ,axis=1)

print df

输出:

df["rno_defined"]= map(lambda x: D[x] if x!="Nan" else "Unknown",df['rno_cd'].values)