我有一个具有4个属性的数据框,可以看成是打击。
我想做的事情要用一个人的名字和年龄,并计算他拥有的朋友数量。那么两个人的年龄相同且名字不同,则以该年龄段的平均朋友数为准。最后将年龄范围划分为年龄组,然后取平均值。这就是我尝试过的方式。
#loc the attribute or features of interest
friends = df.iloc[:,3]
ages = df.iloc[:,2]
# default of dictionary with age as key and value as a list of friends
dictionary_age_friends = defaultdict(list)
# populating the dictionary with key age and values friend
for i,j in zip(ages,friends):
dictionary_age_friends[i].append(j)
print("first dict")
print(dictionary_age_friends)
#second dictionary, the same age is collected and the number of friends is added
set_dict ={}
for x in dictionary_age_friends:
list_friends =[]
for y in dictionary_age_friends[x]:
list_friends.append(y)
set_list_len = len(list_friends) # assign a friend with a number 1
set_dict[x] = set_list_len
print(set_dict)
# set_dict ={}
# for x in dictionary_age_friends:
# print("inside the loop")
# lis_1 =[]
# for y in dictionary_age_friends[x]:
# lis_1.append(y)
# set_list = lis_1
# set_list = [1 for x in set_list] # assign a friend with a number 1
# set_dict[x] = sum(set_list)
# a dictionary that assign the age range into age-groups
second_dict = defaultdict(list)
for i,j in set_dict.items():
if i in range(16,20):
i = 'teens_youthAdult'
second_dict[i].append(j)
elif i in range(20,40):
i ="Adult"
second_dict[i].append(j)
elif i in range(40,60):
i ="MiddleAge"
second_dict[i].append(j)
elif i in range(60,72):
i = "old"
second_dict[i].append(j)
print(second_dict)
print("final dict stared")
new_dic ={}
for key,value in second_dict.items():
if key == 'teens_youthAdult':
new_dic[key] = round((sum(value)/len(value)),2)
elif key =='Adult':
new_dic[key] = round((sum(value)/len(value)),2)
elif key =='MiddleAge' :
new_dic[key] = round((sum(value)/len(value)),2)
else:
new_dic[key] = round((sum(value)/len(value)),2)
new_dic
end_time = datetime.datetime.now()
print(end_time-start_time)
print(new_dic)
我得到的一些反馈是:1,如果您只想计算朋友数,则无需建立列表。 2,两个年龄相同的个人,年龄18。一个有4个朋友,另一个3.当前代码得出的结论是平均有7个朋友。 3,代码不正确,不正确。
有什么建议或帮助吗?多谢所有建议或帮助?
答案 0 :(得分:0)
我不了解属性名称,也没有提及需要按哪个年龄段划分数据。在我的答案中,我将把数据视为属性是:
index, name, age, friend
要按名称查找数量,建议您使用groupby。
输入:
groups = df.groupby([df.iloc[:,0],df.iloc[:,1]]) # grouping by name(0), age(1)
amount_of_friends_df = groups.size() # gathering amount of friends for a person
print(amount_of_friends_df)
输出:
name age
EUNK 25 1
FBFM 26 1
MYYD 30 1
OBBF 28 2
RJCW 25 1
RQTI 21 1
VLIP 16 1
ZCWQ 18 1
ZMQE 27 1
要按年龄查找朋友数量,您还可以使用组
输入:
groups = df.groupby([df.iloc[:,1]]) # groups by age(1)
age_friends = groups.size()
age_friends=age_friends.reset_index()
age_friends.columns=(['age','amount_of_friends'])
print(age_friends)
输出:
age amount_of_friends
0 16 1
1 18 1
2 21 1
3 25 2
4 26 1
5 27 1
6 28 2
7 30 1
要计算每个年龄段的平均朋友数量,您可以使用categories和groupby。
输入:
mean_by_age_group_df = age_friends.groupby(pd.cut(age_friends.age,[20,40,60,72]))\
.agg({'amount_of_friends':'mean'})
print(mean_by_age_group_df)
pd.cut返回我们用来分组数据的分类序列。然后,我们使用agg函数在数据框中聚合组。
输出:
amount_of_friends
age
(20, 40] 1.333333
(40, 60] NaN
(60, 72] NaN