我正在尝试使用map函数将数据中的字符串更改为数值。
这是数据:
label sms_message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
我正在尝试使用以下方法将“垃圾邮件”更改为1,将“火腿”更改为0:
df['label'] = df.label.map({'ham':0, 'spam':1})
但是结果是:
label sms_message
0 NaN Go until jurong point, crazy.. Available only ...
1 NaN Ok lar... Joking wif u oni...
2 NaN Free entry in 2 a wkly comp to win FA Cup fina...
3 NaN U dun say so early hor... U c already then say...
4 NaN Nah I don't think he goes to usf, he lives aro...
有人能找出问题吗?
答案 0 :(得分:1)
您是正确的,我认为您两次执行了同一条语句(1到1)。以下在Python交互式终端上执行的语句对此进行了澄清。
注意:如果您传递字典,则map()会将Series中的所有值替换为
NaN
如果它与字典的键不匹配(我想,您也执行了相同的操作,即两次执行了该语句)。选中pandas map(), apply()。Pandas文档注释:当 arg 是字典时, Series 中不在字典中的值(作为键)将转换为 NaN 。
>>> import pandas as pd
>>>
>>> d = {
... "label": ["ham", "ham", "spam", "ham", "ham"],
... "sms_messsage": [
... "Go until jurong point, crazy.. Available only ...",
... "Ok lar... Joking wif u oni...",
... "Free entry in 2 a wkly comp to win FA Cup fina...",
... "U dun say so early hor... U c already then say...",
... "Nah I don't think he goes to usf, he lives aro..."
... ]
... }
>>>
>>> df = pd.DataFrame(d)
>>> df
label sms_messsage
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
>>>
>>> df['label'] = df.label.map({'ham':0, 'spam':1})
>>> df
label sms_messsage
0 0 Go until jurong point, crazy.. Available only ...
1 0 Ok lar... Joking wif u oni...
2 1 Free entry in 2 a wkly comp to win FA Cup fina...
3 0 U dun say so early hor... U c already then say...
4 0 Nah I don't think he goes to usf, he lives aro...
>>>
>>> df['label'] = df.label.map({'ham':0, 'spam':1})
>>> df
label sms_messsage
0 NaN Go until jurong point, crazy.. Available only ...
1 NaN Ok lar... Joking wif u oni...
2 NaN Free entry in 2 a wkly comp to win FA Cup fina...
3 NaN U dun say so early hor... U c already then say...
4 NaN Nah I don't think he goes to usf, he lives aro...
>>>
>>> import pandas as pd
>>>
>>> d = {
... "label": ['spam', 'ham', 'ham', 'ham', 'spam'],
... "sms_message": ["M1", "M2", "M3", "M4", "M5"]
... }
>>>
>>> df = pd.DataFrame(d)
>>> df
label sms_message
0 spam M1
1 ham M2
2 ham M3
3 ham M4
4 spam M5
>>>
第一种方法-将
map()
与dictionary
参数一起使用
>>> new_values = {'spam': 1, 'ham': 0}
>>>
>>> df
label sms_message
0 spam M1
1 ham M2
2 ham M3
3 ham M4
4 spam M5
>>>
>>> df.label = df.label.map(new_values)
>>> df
label sms_message
0 1 M1
1 0 M2
2 0 M3
3 0 M4
4 1 M5
>>>
第二种方法-将
map()
与function
参数一起使用
>>> df.label = df.label.map(lambda v: 0 if v == 'ham' else 1)
>>> df
label sms_message
0 1 M1
1 0 M2
2 0 M3
3 0 M4
4 1 M5
>>>
第三种方式-将
apply()
与function
参数一起使用
>>> df.label = df.label.apply(lambda v: 0 if v == "ham" else 1)
>>>
>>> df
label sms_message
0 1 M1
1 0 M2
2 0 M3
3 0 M4
4 1 M5
>>>
谢谢。
答案 1 :(得分:0)
也许您的问题与read_table函数有关。
尝试做:
df = pd.read_table('smsspamcollection/SMSSpamCollection',
sep='\t',
header=None,
names=['label', 'sms_message'])