我正在尝试按“类别”一列中的值对数据框进行分组。虽然,“ prob”的其他列中的每一行都包含一个元组列表。当我尝试按“类别”分组时,“问题”列消失。
我当前的df:
$('.checkMe').bind('keyup change', function () {
alert('hi');
//do something
});
预期输出:
category other: prob:
one val [(hi, hello), (jimbob, joe)]
one val2 [(this, not), (is, work), (now, any)]
two val2 [(bob, jones), (work, here)]
three val3 [(milk, coffee), (tea, bread)]
two val3 [(money, here), (job, money)]
做到这一点的最佳方法是什么?抱歉,如果我对这个问题的措辞有误,请让我知道。谢谢!
答案 0 :(得分:4)
您可以使用GroupBy.agg
和join
来汇总数据以用于字符串列,并使用扁平化数据来处理元组-添加了3个解决方案,sum
仅在小数据和性能不重要时使用:>
import functools
import operator
from itertools import chain
f = lambda x: [z for y in x for z in y]
#faster alternative
#f = lambda x: list(chain.from_iterable(x))
#faster alternative2
#f = lambda x: functools.reduce(operator.iadd, x, [])
#slow alternative
#f = lambda x: x.sum()
df = df.groupby('category', as_index=False).agg({'other':', '.join, 'prob':f})
print (df)
category other prob
0 one val, val2 [(hi, hello), (jimbob, joe), (this, not), (is,...
1 three val3 [(milk, coffee), (tea, bread)]
2 two val2, val3 [(bob, jones), (work, here), (money, here), (j...
性能:
测试代码:
np.random.seed(2019)
import perfplot
import functools
import operator
from itertools import chain
default_value = 10
def iadd(df1):
f = lambda x: functools.reduce(operator.iadd, x, [])
d = {'other':', '.join, 'prob':f}
return df1.groupby('category', as_index=False).agg(d)
def listcomp(df1):
f = lambda x: [z for y in x for z in y]
d = {'other':', '.join, 'prob':f}
return df1.groupby('category', as_index=False).agg(d)
def from_iterable(df1):
f = lambda x: list(chain.from_iterable(x))
d = {'other':', '.join, 'prob':f}
return df1.groupby('category', as_index=False).agg(d)
def sum_series(df1):
f = lambda x: x.sum()
d = {'other':', '.join, 'prob':f}
return df1.groupby('category', as_index=False).agg(d)
def sum_groupby_cat(df1):
d = {'other':lambda x: x.str.cat(sep=', '), 'prob':'sum'}
return df1.groupby('category', as_index=False).agg(d)
def sum_groupby_join(df1):
d = {'other': ', '.join, 'prob': 'sum'}
return df1.groupby('category', as_index=False).agg(d)
def make_df(n):
a = np.random.randint(0, n / 10, n)
b = np.random.choice(list('abcdef'), len(a))
c = [tuple(np.random.choice(list(string.ascii_letters), 2)) for _ in a]
df = pd.DataFrame({"category":a, "other":b, "prob":c})
df1 = df.groupby(['category','other'])['prob'].apply(list).reset_index()
return df1
perfplot.show(
setup=make_df,
kernels=[iadd, listcomp, from_iterable, sum_series,sum_groupby_cat,sum_groupby_join],
n_range=[10**k for k in range(1, 8)],
logx=True,
logy=True,
equality_check=False,
xlabel='len(df)')
答案 1 :(得分:2)
您可以GroupBy
category
列并使用以下功能进行汇总:
df.groupby('category', as_index=False).agg({'other':lambda x: x.str.cat(sep=', '),
'prob':'sum'})
前几行给出:
category other prob
0 one val, val2 [(hi, hello), (jimbob, joe), (this, not), (is,...
1 two val2 [(bob, jones), (work, here)]
答案 2 :(得分:0)