我试图使用多列的groupby获得最高评分,如果没有该特定groupby的组合,则会给我一个错误。如何做多种组合?
数据:
maritalstatus gender age_range occ rating
ma M young student PG
ma F adult teacher R
sin M young student PG
sin M adult teacher R
ma M young student PG
sin F adult teacher R
代码:
def get_top( maritalstatus, gender,age_range, occ):
m = df.groupby(['maritalstatus',' gender', 'age_range', 'occ'])
['rating'].apply(lambda x: x.value_counts().index[0 ])
mpaa = m[maritalstatus][gender][age_range][occ]
return mpaa
输入:
get_top('ma', 'M', 'young','teacher)
输出: 由于没有这样的组合,我会抛出一个错误。
如果没有这样的组合,我的功能应该限制,结婚,男性和年轻,而不是教师,因为没有这样的组合。
答案 0 :(得分:0)
这是一只非熊猫的解决方案。 Counter.most_common()
以最常见的降序计数结果。
from collections import Counter
def get_top(maritalstatus=None, gender=None, age_range=None, occ=None):
cols = ['maritalstatus', 'gender', 'age_range', 'occ']
values = [maritalstatus, gender, age_range, occ]
c = Counter(df.query(' & '.join((('({0} == "{1}")').format(i, j)) \
for i, j in zip(cols, values) if j))['rating'])
return c.most_common()
get_top(maritalstatus='ma', gender='M', age_range='young') # [('PG', 2)]
答案 1 :(得分:0)
您可以使用*args
进行动态输入,(使用query
排序值无法更改)进行过滤:
def get_top(*args):
c = ['maritalstatus', 'gender', 'age_range', 'occ']
m = (df.groupby(c)['rating'].apply(lambda x: x.value_counts().index[0])
.reset_index())
args = list(args)
while True:
d = dict(zip(c, args))
#https://stackoverflow.com/a/48371587/2901002
q = ' & '.join((('({} == "{}")').format(i, j)) for i, j in d.items())
m1 = m.query(q)['rating']
if m1.empty and len(args) > 1:
args.pop()
else:
return m1
print(get_top('ma', 'M', 'young','teacher'))
1 PG
Name: rating, dtype: object
答案 2 :(得分:0)
pandas
绝对是用于处理详细表格数据的goto库。对于那些寻求非pandas
选项的人,您可以构建自己的映射和简化函数。我使用这些术语来表示以下内容:
pandas
类似 groupby / 聚合概念。
<强>鉴于强>
使用单个分隔符替换多个空格的清理数据,例如","
。
%%file "test.txt"
status,gender,age_range,occ,rating
ma,M,young,student,PG
ma,F,adult,teacher,R
sin,M,young,student,PG
sin,M,adult,teacher,R
ma,M,young,student,PG
sin,F,adult,teacher,R
<强>代码强>
import csv
import collections as ct
第1步:读取数据
def read_file(fname):
with open(fname, "r") as f:
reader = csv.DictReader(f)
for line in reader:
yield line
iterable = [line for line in read_file("test.txt")]
iterable
输出
[OrderedDict([('status', 'ma'),
('gender', 'M'),
('age_range', 'young'),
('occ', 'student'),
('rating', 'PG')]),
OrderedDict([('status', 'ma'),
('gender', 'F'),
('age_range', 'adult'),
...]
...
]
第2步:重新映射数据
def mapping(data, column):
"""Return a dict of regrouped data."""
dd = ct.defaultdict(list)
for d in data:
key = d[column]
value = {k: v for k, v in d.items() if k != column}
dd[key].append(value)
return dict(dd)
mapping(iterable, "gender")
输出
{'M': [
{'age_range': 'young', 'occ': 'student', 'rating': 'PG', ...},
...]
'F': [
{'status': 'ma', 'age_range': 'adult', ...},
...]
}
第3步:减少数据
def reduction(data):
"""Return a reduced mapping of Counters."""
final = {}
for key, val in data.items():
agg = ct.defaultdict(ct.Counter)
for d in val:
for k, v in d.items():
agg[k][v] += 1
final[key] = dict(agg)
return final
reduction(mapping(iterable, "gender"))
输出
{'F': {
'age_range': Counter({'adult': 2}),
'occ': Counter({'teacher': 2}),
'rating': Counter({'R': 2}),
'status': Counter({'ma': 1, 'sin': 1})},
'M': {
'age_range': Counter({'adult': 1, 'young': 3}),
'occ': Counter({'student': 3, 'teacher': 1}),
'rating': Counter({'PG': 3, 'R': 1}),
'status': Counter({'ma': 2, 'sin': 2})}
}
<强>演示强>
使用这些工具,您可以构建数据管道并查询数据,将结果从一个函数提供给另一个函数:
# Find the top age range amoung males
pipeline = reduction(mapping(iterable, "gender"))
pipeline["M"]["age_range"].most_common(1)
# [('young', 3)]
# Find the top ratings among teachers
pipeline = reduction(mapping(iterable, "occ"))
pipeline["teacher"]["rating"].most_common()
# [('R', 3)]
# Find the number of married people
pipeline = reduction(mapping(iterable, "gender"))
sum(v["status"]["ma"] for k, v in pipeline.items())
# 3
总的来说,您可以根据定义缩小功能的方式定制输出。
请注意,尽管该应用程序对许多数据列有强大的应用,但此通用过程的代码比former example更详细。 pandas
简洁地封装了这些概念。虽然学习曲线最初可能更陡峭,但它可以大大加快数据分析。
<强>详情
csv.DictReader
解析已清理文件的每一行,它将标题名称维护为字典的键。此结构有助于按名称轻松访问列。"M"
,"F"
。 defaultdict
和Counter
组合在一起构建了一个出色的简化数据结构,其中defaultdict
的新条目初始化Counter
,重复条目只是简单地计算观察结果。<强>应用强>
管道是可选的。在这里,我们将构建一个处理串行请求的函数:
def serial_reduction(iterable, val_queries):
"""Return a `Counter` that is reduced after serial queries."""
q1, *qs = val_queries
val_to_key = {v:k for k, v in iterable[0].items()}
values_list = mapping(iterable, val_to_key[q1])[q1]
counter = ct.Counter()
# Process queries for dicts in each row and build a counter
for q in qs:
try:
for row in values_list[:]:
if val_to_key[q] not in row:
continue
else:
reduced_vals = {v for v in row.values() if v not in qs}
for val in reduced_vals:
counter[val] += 1
except KeyError:
raise ValueError("'{}' not found. Try a new query.".format(q))
return counter
c = serial_reduction(iterable, "ma M young".split())
c.most_common()
# [('student', 2), ('PG', 2)]
serial_reduction(iterable, "ma M young teacher".split())
# ValueError: 'teacher' not found. Try a new query.