I'm working with an airbnb dataset on Kaggle:
https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings
and want to simplify the values for the language column into 2 groupings - english and non-english.
For instance:
users.language.value_counts()
en 15011
zh 101
fr 99
de 53
es 53
ko 43
ru 21
it 20
ja 19
pt 14
sv 11
no 6
da 5
nl 4
el 2
pl 2
tr 2
cs 1
fi 1
is 1
hu 1
Name: language, dtype: int64
And the result I want it is:
users.language.value_counts()
english 15011
non-english 459
Name: language, dtype: int64
This is sort of the solution I want:
def language_groupings():
for i in users:
if users.language !='en':
replace(users.language.str, 'non-english')
else:
replace(users.language.str, 'english')
return users
users['language'] = users.apply(lambda row: language_groupings)
Except there's obviously something wrong with this as it returns an empty series when I run value_counts on the column.
答案 0 :(得分:2)
is that what you want?
In [181]: x
Out[181]:
val
en 15011
zh 101
fr 99
de 53
es 53
ko 43
ru 21
it 20
ja 19
pt 14
sv 11
no 6
da 5
nl 4
el 2
pl 2
tr 2
cs 1
fi 1
is 1
hu 1
In [182]: x.groupby(x.index == 'en').sum()
Out[182]:
val
False 459
True 15011
答案 1 :(得分:2)
Try this:
users.language = np.where( users.language !='en', 'non-english', 'english' )