我有一个数据框:
import pandas as pd
import numpy as np
df = pd.DataFrame([['M',2014,'Seth',5],
['M',2014,'Spencer',5],
['M',2014,'Tyce',5],
['F',2014,'Seth',25],
['F',2014,'Spencer',23]],columns =['sex','year','name','number'])
print df
我想找到2014年最性别模糊的名字。我尝试了很多方法,但还没有运气。
答案 0 :(得分:0)
不确定“大多数性别歧视”是什么意思,但你可以从这个开始
>>> dfy = (df.year == 2014)
>>> dfF = df[(df.sex == 'F') & dfy][['name', 'number']]
>>> dfM = df[(df.sex == 'M') & dfy][['name', 'number']]
>>> pd.merge(dfF, dfM, on=['name'])
name number_x number_y
0 Seth 25 5
1 Spencer 23 5
如果您只想要总数最多的名称,那么:
>>> dfT = pd.merge(dfF, dfM, on=['name'])
>>> dfT
name number_x number_y
0 Seth 25 5
1 Spencer 23 5
>>> dfT['total'] = dfT['number_x'] + dfT['number_y']
>>> dfT.sort_values('total', ascending=False).head(1)
name number_x number_y total
0 Seth 25 5 30
答案 1 :(得分:0)
首先,您需要获取性别模糊名称列表。我建议使用set intersection:
>>> male_names = df[df.sex == "M"].name
>>> female_names = df[df.sex == "F"].name
>>> gender_ambiguous_names = list(set(male_names).intersection(set(female_names)))
现在,您希望实际对数据进行子集化,以便在2014年仅显示性别模糊名称。您可能希望使用成员资格条件并将布尔条件链接为一行:
>>> gender_ambiguous_data_2014 = df[(df.name.isin(gender_ambiguous_names)) & (df.year == 2014)]
现在你有gender_ambiguous_data_2014
:
>>> gender_ambiguous_data_2014
sex year name number
0 M 2014 Seth 5
1 M 2014 Spencer 5
3 F 2014 Seth 25
4 F 2014 Spencer 23
然后你只需按数字聚合:
>>> gender_ambiguous_data_2014.groupby('name').number.sum()
name
Seth 30
Spencer 28
Name: number, dtype: int64
现在,你想要的最后一件事是获得具有最高数字的名称。但实际上,您可能会有性别模糊的名称,这些名称具有相同的总数。我们应该将之前的结果应用于新变量gender_ambiguous_numbers_2014
并使用它:
>>> gender_ambiguous_numbers_2014 = gender_ambiguous_data_2014.groupby('name').number.sum()
>>> # get the max and find the list of names:
>>> gender_ambiguous_max_2014 = gender_ambiguous_numbers_2014[gender_ambiguous_numbers_2014 == gender_ambiguous_numbers_2014.max()]
现在你明白了:
>>> gender_ambiguous_max_2014
name
Seth 30
Name: number, dtype: int64
很酷,让我们提取索引名称吧!
>>> gender_ambiguous_max_2014.index
Index([u'Seth'], dtype='object')
等等,这种类型到底是什么? (提示:它是pandas.core.index.Index
)
没问题,只需应用列表强制:
>>> list(gender_ambiguous_max_2014.index)
['Seth']
所以,在这种情况下,我们的列表只有元素。但也许我们想写一个函数,它为唯一的竞争者返回一个字符串,或者如果一些性别模糊的名字在那一年有相同的总数,则返回一个字符串列表。
在下面的包装函数中,我用ga
缩写了我的变量名以缩短代码。当然,这是假设数据集与您显示的格式相同,并命名为df
。如果它被命名,则只需相应地更改df
。
def get_most_popular_gender_ambiguous_name(year):
"""Get the gender ambiguous name with the most numbers in a certain year.
Returns:
a string, or a list of strings
Note:
'gender_ambiguous' will be abbreviated as 'ga'
"""
# get the gender ambiguous names
male_names = df[df.sex == "M"].name
female_names = df[df.sex == "F"].name
ga_names = list(set(male_names).intersection(set(female_names)))
# filter by year
ga_data = df[(df.name.isin(ga_names)) & (df.year == year)]
# aggregate to get total numbers
ga_total_numbers = ga_data.groupby('name').number.sum()
# find the max number
ga_max_number = ga_total_numbers.max()
# subset the Series to only those that have max numbers
ga_max_data = ga_total_numbers[
ga_total_numbers == ga_max_number
]
# get the index (the names) for those satisfying the conditions
most_popular_ga_names = list(ga_max_data.index) # list coercion
# if list only contains one element, return the only element
if len(most_popular_ga_names) == 1:
return most_popular_ga_names[0]
return most_popular_ga_names
现在,调用此函数非常简单:
>>> get_most_popular_gender_ambiguous_name(2014) # assuming df is dataframe var name
'Seth'