我有两个列,如下所示,并尝试返回第二列的最高计数,但它只是给我最高的评分数而不考虑性别
数据:
print(df)
AGE GENDER rating
0 10 M PG
1 10 M R
2 10 M R
3 4 F PG13
4 4 F PG13
代码:
s = (df.groupby(['AGE', 'GENDER'])['rating']
.apply(lambda x: x.value_counts().head(2))
.rename_axis(('a','b', 'c'))
.reset_index(level=2)['c'])
输出:
print (s[F])
('PG')
print(s[M]
('PG', 'R')
答案 0 :(得分:2)
以下是此文件的标准库解决方案:
%%file "test.txt"
gender rating
M PG
M R
F NR
M R
F PG13
F PG13
<强>鉴于强>
import collections as ct
def read_file(fname):
with open(fname, "r") as f:
header = next(f)
for line in f:
gender, rating = line.strip().split()
yield gender, rating
<强>代码强>
filename = "test.txt"
dd = ct.defaultdict(ct.Counter)
for k, v in sorted(read_file(filename), key=lambda x: x[0]):
dd[k][v] += 1
{k: v.most_common(1) for k, v in dd.items()}
# {'F': [('PG13', 2)], 'M': [('R', 2)]}
<强>详情
解析文件的每一行并将其添加到defaultdict
。键是性别,但每个性别的每个评级的值为Counter
个对象。调用Counter.most_common()
来检索最常出现的事件。
由于数据按性别分组,因此您可以浏览更多信息。例如,每个性别的唯一评分:
{k: set(v.elements()) for k, v in dd.items()}
# {'F': {'NR', 'PG13'}, 'M': {'PG', 'R'}}
答案 1 :(得分:1)
我认为您需要使用groupby
+ value_counts
+ head
来计算类别和评分:
df1 = (df.groupby('gender')['rating']
.apply(lambda x: x.value_counts().head(1))
.rename_axis(('gender','rating'))
.reset_index(name='val'))
print (df1)
gender rating val
0 F PG13 2
1 M R 2
如果只想获得最高评级,请选择每组索引的第一个值:
s = df.groupby('gender')['rating'].apply(lambda x: x.value_counts().index[0])
print (s)
gender
F PG13
M R
Name: rating, dtype: object
print (s['M'])
R
print (s['F'])
PG13
或者只有最高计数选择每组Series
的第一个值:
s = df.groupby('gender')['rating'].apply(lambda x: x.value_counts().iat[0])
print (s)
gender
F 2
M 2
Name: rating, dtype: int64
print (s['M'])
2
print (s['F'])
2
编辑:
s = df.groupby('gender')['rating'].apply(lambda x: x.value_counts().index[0])
def gen_mpaa(gender):
return s[gender]
print (gen_mpaa('M'))
print (gen_mpaa('F'))
编辑:
解决方案,如果genre id
值是字符串:
print (type(df.loc[0, 'genre id']))
<class 'str'>
df = df.set_index('gender')['genre id'].str.split(',', expand=True).stack()
print (df)
gender
M 0 11
1 22
2 33
0 22
1 44
2 55
0 33
1 44
2 55
F 0 11
1 22
0 22
1 55
0 55
1 44
dtype: object
d = df.groupby(level=0).apply(lambda x: x.value_counts().index[0]).to_dict()
print (d)
{'M': '55', 'F': '55'}
EDIT1:
print (df)
AGE GENDER rating
0 10 M PG
1 10 M R
2 10 M R
3 4 F PG13
4 4 F PG13
s = (df.groupby(['AGE', 'GENDER'])['rating']
.apply(lambda x: x.value_counts().head(2))
.rename_axis(('a','b', 'c'))
.reset_index(level=2)['c'])
print (s)
a b
4 F PG13
10 M R
M PG
Name: c, dtype: object