我的数据框df看起来像 -
attribute_ids attributes_names
['adr4r','5ty6gh'] ['abc','xyz']
['fg67y','ty67g','ght43','adr4r'] ['pqr','abc','xyz','abc']
我想计算每个唯一atrribute_id存在的次数,并按降序显示。但在结果中我也想显示其对应的attribute_names。请注意,属性名称不是唯一的,attribute_ids是。例如:adr4r
和ty67g
bith具有相同的名称" abc"。输出应该看起来像 -
attribute_ids atribute_names count
adr4r abc 2
ty67g abc 1
5ty6gh xyz 1
ght43 xyz 1
fg67y pqr 1
目前我只能依靠attribute_ids(无法显示相应的attribute_names):
count=df.attribute_ids.apply(pd.Series).stack().value_counts()
答案 0 :(得分:3)
选项1
pir1
np.concatenate
np.unique
来识别唯一值,并且......
return_counts=True
names
return_index=True
进行切片
ids = np.concatenate(df.attribute_ids)
names = np.concatenate(df.attribute_names)
u, idx, cts = np.unique(ids, return_index=True, return_counts=True)
pd.DataFrame(dict(
attribute_ids=u,
attribute_names=names[idx],
count=cts
))
attribute_ids attribute_names count
0 5ty6gh xyz 1
1 adr4r abc 2
2 fg67y pqr 1
3 ght43 xyz 1
4 ty67g abc 1
选项2
pir2
attribute_ids
列上使用分组,然后使用agg
from cytools import concat
d1 = pd.DataFrame(dict(
attribute_ids=list(concat(df.attribute_ids.values.tolist())),
attribute_names=list(concat(df.attribute_names.values.tolist()))
))
d1.groupby('attribute_ids').attribute_names.agg(['first', 'count']) \
.reset_index().rename(columns=dict(first='attribute_names'))
attribute_ids attribute_names count
0 5ty6gh xyz 1
1 adr4r abc 2
2 fg67y pqr 1
3 ght43 xyz 1
4 ty67g abc 1
选项3
pir3
在元组上使用pd.factorize
。使用concat
展平数组。
from cytoolz import concat
i = concat(df.attribute_ids.values.tolist())
n = concat(df.attribute_names.values.tolist())
f, u = pd.Series(list(zip(i, n))).factorize()
return pd.DataFrame(
u.tolist(),
columns=['attribute_ids', 'attribute_names']
).assign(count=np.bincount(f))
attribute_ids attribute_names count
0 adr4r abc 2
1 5ty6gh xyz 1
2 fg67y pqr 1
3 ty67g abc 1
4 ght43 xyz 1
计时
results.div(results.min(1), 0).round(2).pipe(lambda d: d.assign(Best=d.idxmin(1)))
pir1 pir2 pir3 galt Best
N
1 1.00 4.37 1.75 4.07 pir1
3 1.00 3.84 1.54 4.58 pir1
10 1.48 2.46 1.00 2.38 pir3
30 2.42 3.09 1.00 2.86 pir3
100 5.56 2.42 1.00 2.58 pir3
300 14.86 2.52 1.00 2.42 pir3
1000 24.63 1.37 1.00 1.43 pir3
3000 38.14 1.47 1.00 1.35 pir3
10000 41.85 1.36 1.00 1.14 pir3
fig, (a1, a2) = plt.subplots(1, 2, figsize=(6, 6))
results.plot(loglog=True, lw=3, ax=a1)
results.div(results.min(1), 0).round(2).plot.barh(logx=True, ax=a2)
fig.tight_layout()
代码
def galt(df):
cols = df.columns.tolist()
return pd.DataFrame({
c: [v for L in df[c] for v in L] for c in cols
}).groupby(cols).size().reset_index(name='count')
def pir1(df):
ids = np.concatenate(df.attribute_ids)
names = np.concatenate(df.attribute_names)
u, idx, cts = np.unique(ids, return_index=True, return_counts=True)
return pd.DataFrame(dict(
attribute_ids=u,
attribute_names=names[idx],
count=cts
))
def pir2(df):
d1 = pd.DataFrame(dict(
attribute_ids=list(concat(df.attribute_ids.values.tolist())),
attribute_names=list(concat(df.attribute_names.values.tolist()))
))
return d1.groupby('attribute_ids').attribute_names.agg(['first', 'count']) \
.reset_index().rename(columns=dict(first='attribute_names'))
def pir3(df):
i = concat(df.attribute_ids.values.tolist())
n = concat(df.attribute_names.values.tolist())
f, u = pd.Series(list(zip(i, n))).factorize()
return pd.DataFrame(
u.tolist(),
columns=['attribute_ids', 'attribute_names']
).assign(count=np.bincount(f))
results = pd.DataFrame(
index=pd.Index([1, 3, 10, 30, 100, 300, 1000, 3000, 10000], name='N'),
columns='pir1 pir2 pir3 galt'.split(),
dtype=float
)
for i in results.index:
d = pd.concat([df] * i, ignore_index=True)
for j in results.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
results.set_value(i, j, timeit(stmt, setp, number=10))
设置
df = pd.DataFrame(dict(
attribute_ids=[['adr4r', '5ty6gh'], ['fg67y', 'ty67g', 'ght43', 'adr4r']],
attribute_names=[['abc', 'xyz'], ['pqr', 'abc', 'xyz', 'abc']]
))
答案 1 :(得分:3)
这是一种方法,通过展平数据框
In [1292]: cols = df.columns.tolist()
In [1293]: (pd.DataFrame({c: [v for L in df[c] for v in L] for c in cols})
.groupby(cols).size())
Out[1293]:
attribute_ids attributes_names
5ty6gh xyz 1
adr4r abc 2
fg67y pqr 1
ght43 xyz 1
ty67g abc 1
dtype: int64
详细
In [1294]: df
Out[1294]:
attribute_ids attributes_names
0 [adr4r, 5ty6gh] [abc, xyz]
1 [fg67y, ty67g, ght43, adr4r] [pqr, abc, xyz, abc]
In [1295]: cols = df.columns
Out[1295]: ['attribute_ids', 'attributes_names']
In [1296]: {c: [v for L in df[c] for v in L] for c in cols}
Out[1296]:
{'attribute_ids': ['adr4r', '5ty6gh', 'fg67y', 'ty67g', 'ght43', 'adr4r'],
'attributes_names': ['abc', 'xyz', 'pqr', 'abc', 'xyz', 'abc']}
In [1297]: pd.DataFrame({c: [v for L in df[c] for v in L] for c in cols})
Out[1297]:
attribute_ids attributes_names
0 adr4r abc
1 5ty6gh xyz
2 fg67y pqr
3 ty67g abc
4 ght43 xyz
5 adr4r abc
或者,如果你想要一系列元组
In [1311]: pd.Series(list(
zip(*[[v for L in df[c] for v in L] for c in cols]))).value_counts()
Out[1311]:
(adr4r, abc) 2
(fg67y, pqr) 1
(ght43, xyz) 1
(ty67g, abc) 1
(5ty6gh, xyz) 1
dtype: int64