重新编码Python Pandas中的分类标签

时间:2018-07-19 14:02:04

标签: python pandas

我正在努力重新编码一些分类标签。这是我的最小示例。

import pandas as pd
testDict = {'Col1' : pd.Categorical(["a", "b", "c", "d", "e"]),
          'Col2' : pd.Categorical(["1", "2", "3", "4", "5"])}

testDF = pd.DataFrame.from_dict(testDict)
testDF
testDF['Col1'].value_counts()
def letter_recode(Col1):
    if(Col1=="a")|(Col1=="b"):
        return "ab"
    elif (Col1=="c")|(Col1=="d"):
        return "cd"
    else:
        return Col1

testDF['Col3'] = testDF['Col1'].apply(letter_recode)

testDF['Col3'].value_counts()
testDF

我想更改此df:

   Col1 Col2
0   a   1
1   b   2
2   c   3
3   d   4
4   e   5

对此:

  Col1 Col2 Col3
0   a   1   ab
1   b   2   ab
2   c   3   cd
3   d   4   cd
4   e   5   e

以上方法有效,但是当我在实际数据帧上尝试此代码时,没有任何变化。另外,当我尝试为数据框创建一个小片段并运行代码时,出现以下错误,并且不了解与之相关的文档。

df5 = df.loc[0:4,:]
df5
    age workclass   fnlwgt  education   education-num   marital-status  occupation  relationship    race    sex capital-gain    capital-loss    hours-per-week  native-country  salary  workclassR
0   50  Self-emp-not-inc    83311   Bachelors   13  Married-civ-spouse  Exec-managerial Husband White   Male    0   0   13  United-States   <=50K   Self-emp-not-inc
1   38  Private 215646  HS-grad 9   Divorced    Handlers-cleaners   Not-in-family   White   Male    0   0   40  United-States   <=50K   Private
2   53  Private 234721  11th    7   Married-civ-spouse  Handlers-cleaners   Husband Black   Male    0   0   40  United-States   <=50K   Private
3   28  Private 338409  Bachelors   13  Married-civ-spouse  Prof-specialty  Wife    Black   Female  0   0   40  Cuba    <=50K   Private
4   37  Private 284582  Masters 14  Married-civ-spouse  Exec-managerial Wife    White   Female  0   0   40  United-States   <=50K   Private

def rename_workclass(wc):
    if(wc=="Never-worked")|(wc=="Without-pay"):
        return "Unemployed"
    elif (wc=="State-gov")|(wc=="Local-gov"):
        return "Gov"
    elif (wc=="Self-emp-inc")|(wc=="Self-emp-not-inc"):
        return "Self-emp"
    else:
        return wc


df5['workclassR'] = df5['workclass'].apply(rename_workclass)
  

C:\ Users \ karol \ Anaconda3 \ lib \ site-packages \ ipykernel_launcher.py:12:   SettingWithCopyWarning:试图在一个副本上设置一个值   从DataFrame切片。尝试使用.loc [row_indexer,col_indexer] =   值代替

     

请参阅文档中的警告:   http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy   如果sys.path [0] =='':

非常感谢您的帮助,我的问题是值前面有空格。我试图将它们与没有空格的字符串进行比较。另外,可以通过声明切片的数据集不是副本来消除上述错误:

df5 = df.iloc[0:4, :]  # to access the column at the nth position
df5.is_copy = False

2 个答案:

答案 0 :(得分:1)

尝试使用pd.Series.map()。一个玩具示例:

s = s.map({"Private": "Private-changed", 
       "Public": "Public_changed",
       "?": "What is this"})
s

这给您:

0    Private-changed
1     Public_changed
2       What is this

答案 1 :(得分:1)

您可以将pd.Series.map与字典一起使用,然后将fillna与原始系列一起使用:

import pandas as pd

df = pd.DataFrame({'Col1' : pd.Categorical(["a", "b", "c", "d", "e"]),
                   'Col2' : pd.Categorical(["1", "2", "3", "4", "5"])})

mapper = {'a': 'ab', 'b': 'ab', 'c': 'cd', 'd': 'cd'}

df['Col3'] = df['Col1'].map(mapper).fillna(df['Col1'])

print(df['Col3'].value_counts())

cd    2
ab    2
e     1
Name: Col3, dtype: int64