我正在努力重新编码一些分类标签。这是我的最小示例。
import pandas as pd
testDict = {'Col1' : pd.Categorical(["a", "b", "c", "d", "e"]),
'Col2' : pd.Categorical(["1", "2", "3", "4", "5"])}
testDF = pd.DataFrame.from_dict(testDict)
testDF
testDF['Col1'].value_counts()
def letter_recode(Col1):
if(Col1=="a")|(Col1=="b"):
return "ab"
elif (Col1=="c")|(Col1=="d"):
return "cd"
else:
return Col1
testDF['Col3'] = testDF['Col1'].apply(letter_recode)
testDF['Col3'].value_counts()
testDF
我想更改此df:
Col1 Col2
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
对此:
Col1 Col2 Col3
0 a 1 ab
1 b 2 ab
2 c 3 cd
3 d 4 cd
4 e 5 e
以上方法有效,但是当我在实际数据帧上尝试此代码时,没有任何变化。另外,当我尝试为数据框创建一个小片段并运行代码时,出现以下错误,并且不了解与之相关的文档。
df5 = df.loc[0:4,:]
df5
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary workclassR
0 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K Self-emp-not-inc
1 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K Private
2 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K Private
3 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K Private
4 37 Private 284582 Masters 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States <=50K Private
def rename_workclass(wc):
if(wc=="Never-worked")|(wc=="Without-pay"):
return "Unemployed"
elif (wc=="State-gov")|(wc=="Local-gov"):
return "Gov"
elif (wc=="Self-emp-inc")|(wc=="Self-emp-not-inc"):
return "Self-emp"
else:
return wc
df5['workclassR'] = df5['workclass'].apply(rename_workclass)
C:\ Users \ karol \ Anaconda3 \ lib \ site-packages \ ipykernel_launcher.py:12: SettingWithCopyWarning:试图在一个副本上设置一个值 从DataFrame切片。尝试使用.loc [row_indexer,col_indexer] = 值代替
请参阅文档中的警告: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy 如果sys.path [0] =='':
非常感谢您的帮助,我的问题是值前面有空格。我试图将它们与没有空格的字符串进行比较。另外,可以通过声明切片的数据集不是副本来消除上述错误:
df5 = df.iloc[0:4, :] # to access the column at the nth position
df5.is_copy = False
答案 0 :(得分:1)
尝试使用pd.Series.map()
。一个玩具示例:
s = s.map({"Private": "Private-changed",
"Public": "Public_changed",
"?": "What is this"})
s
这给您:
0 Private-changed
1 Public_changed
2 What is this
答案 1 :(得分:1)
您可以将pd.Series.map
与字典一起使用,然后将fillna
与原始系列一起使用:
import pandas as pd
df = pd.DataFrame({'Col1' : pd.Categorical(["a", "b", "c", "d", "e"]),
'Col2' : pd.Categorical(["1", "2", "3", "4", "5"])})
mapper = {'a': 'ab', 'b': 'ab', 'c': 'cd', 'd': 'cd'}
df['Col3'] = df['Col1'].map(mapper).fillna(df['Col1'])
print(df['Col3'].value_counts())
cd 2
ab 2
e 1
Name: Col3, dtype: int64