Question

我的数据框df看起来像 -

attribute_ids                     attributes_names
['adr4r','5ty6gh']                ['abc','xyz'] 
['fg67y','ty67g','ght43','adr4r'] ['pqr','abc','xyz','abc']

我想计算每个唯一atrribute_id存在的次数，并按降序显示。但在结果中我也想显示其对应的attribute_names。请注意，属性名称不是唯一的，attribute_ids是。例如：adr4r和ty67g bith具有相同的名称＆＃34; abc＆＃34;。输出应该看起来像 -

attribute_ids       atribute_names     count
adr4r               abc                2
ty67g               abc                1
5ty6gh              xyz                1
ght43               xyz                1
fg67y               pqr                1

目前我只能依靠attribute_ids（无法显示相应的attribute_names）：

count=df.attribute_ids.apply(pd.Series).stack().value_counts()

Answer 1

选项1
pir1

将列与np.concatenate
使用np.unique来识别唯一值，并且......
- 使用参数return_counts=True
- 第一次出现的位置的索引，因此我可以使用参数names

ids = np.concatenate(df.attribute_ids)
names = np.concatenate(df.attribute_names)

u, idx, cts = np.unique(ids, return_index=True, return_counts=True)

pd.DataFrame(dict(
    attribute_ids=u,
    attribute_names=names[idx],
    count=cts
))

  attribute_ids attribute_names  count
0        5ty6gh             xyz      1
1         adr4r             abc      2
2         fg67y             pqr      1
3         ght43             xyz      1
4         ty67g             abc      1

选项2
pir2

与选项1类似，我们将列展平为
然后在attribute_ids列上使用分组，然后使用agg

from cytools import concat

d1 = pd.DataFrame(dict(
    attribute_ids=list(concat(df.attribute_ids.values.tolist())),
    attribute_names=list(concat(df.attribute_names.values.tolist()))
))

d1.groupby('attribute_ids').attribute_names.agg(['first', 'count']) \
    .reset_index().rename(columns=dict(first='attribute_names'))

  attribute_ids attribute_names  count
0        5ty6gh             xyz      1
1         adr4r             abc      2
2         fg67y             pqr      1
3         ght43             xyz      1
4         ty67g             abc      1

选项3
pir3
在元组上使用pd.factorize。使用concat展平数组。

from cytoolz import concat

i = concat(df.attribute_ids.values.tolist())
n = concat(df.attribute_names.values.tolist())
f, u = pd.Series(list(zip(i, n))).factorize()
return pd.DataFrame(
    u.tolist(),
    columns=['attribute_ids', 'attribute_names']
).assign(count=np.bincount(f))

  attribute_ids attribute_names  count
0         adr4r             abc      2
1        5ty6gh             xyz      1
2         fg67y             pqr      1
3         ty67g             abc      1
4         ght43             xyz      1

计时

results.div(results.min(1), 0).round(2).pipe(lambda d: d.assign(Best=d.idxmin(1)))

        pir1  pir2  pir3  galt  Best
N                                   
1       1.00  4.37  1.75  4.07  pir1
3       1.00  3.84  1.54  4.58  pir1
10      1.48  2.46  1.00  2.38  pir3
30      2.42  3.09  1.00  2.86  pir3
100     5.56  2.42  1.00  2.58  pir3
300    14.86  2.52  1.00  2.42  pir3
1000   24.63  1.37  1.00  1.43  pir3
3000   38.14  1.47  1.00  1.35  pir3
10000  41.85  1.36  1.00  1.14  pir3

fig, (a1, a2) = plt.subplots(1, 2, figsize=(6, 6))
results.plot(loglog=True, lw=3, ax=a1)
results.div(results.min(1), 0).round(2).plot.barh(logx=True, ax=a2)
fig.tight_layout()

代码

def galt(df):
    cols = df.columns.tolist()
    return pd.DataFrame({
        c: [v for L in df[c] for v in L] for c in cols
    }).groupby(cols).size().reset_index(name='count')

def pir1(df):
    ids = np.concatenate(df.attribute_ids)
    names = np.concatenate(df.attribute_names)

    u, idx, cts = np.unique(ids, return_index=True, return_counts=True)

    return pd.DataFrame(dict(
        attribute_ids=u,
        attribute_names=names[idx],
        count=cts
    ))

def pir2(df):
    d1 = pd.DataFrame(dict(
        attribute_ids=list(concat(df.attribute_ids.values.tolist())),
        attribute_names=list(concat(df.attribute_names.values.tolist()))
    ))

    return d1.groupby('attribute_ids').attribute_names.agg(['first', 'count']) \
        .reset_index().rename(columns=dict(first='attribute_names'))

def pir3(df):
    i = concat(df.attribute_ids.values.tolist())
    n = concat(df.attribute_names.values.tolist())
    f, u = pd.Series(list(zip(i, n))).factorize()
    return pd.DataFrame(
        u.tolist(),
        columns=['attribute_ids', 'attribute_names']
    ).assign(count=np.bincount(f))


results = pd.DataFrame(
    index=pd.Index([1, 3, 10, 30, 100, 300, 1000, 3000, 10000], name='N'),
    columns='pir1 pir2 pir3 galt'.split(),
    dtype=float
)

for i in results.index:
    d = pd.concat([df] * i, ignore_index=True)
    for j in results.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        results.set_value(i, j, timeit(stmt, setp, number=10))

设置

df = pd.DataFrame(dict(
    attribute_ids=[['adr4r', '5ty6gh'], ['fg67y', 'ty67g', 'ght43', 'adr4r']],
    attribute_names=[['abc', 'xyz'], ['pqr', 'abc', 'xyz', 'abc']]
))

Answer 2

这是一种方法，通过展平数据框

In [1292]: cols = df.columns.tolist()

In [1293]: (pd.DataFrame({c: [v for L in df[c] for v in L] for c in cols})
              .groupby(cols).size())
Out[1293]:
attribute_ids  attributes_names
5ty6gh         xyz                 1
adr4r          abc                 2
fg67y          pqr                 1
ght43          xyz                 1
ty67g          abc                 1
dtype: int64

详细

In [1294]: df
Out[1294]:
                  attribute_ids      attributes_names
0               [adr4r, 5ty6gh]            [abc, xyz]
1  [fg67y, ty67g, ght43, adr4r]  [pqr, abc, xyz, abc]

In [1295]: cols = df.columns
Out[1295]: ['attribute_ids', 'attributes_names']

In [1296]: {c: [v for L in df[c] for v in L] for c in cols}
Out[1296]:
{'attribute_ids': ['adr4r', '5ty6gh', 'fg67y', 'ty67g', 'ght43', 'adr4r'],
 'attributes_names': ['abc', 'xyz', 'pqr', 'abc', 'xyz', 'abc']}

In [1297]: pd.DataFrame({c: [v for L in df[c] for v in L] for c in cols})
Out[1297]:
  attribute_ids attributes_names
0         adr4r              abc
1        5ty6gh              xyz
2         fg67y              pqr
3         ty67g              abc
4         ght43              xyz
5         adr4r              abc

或者，如果你想要一系列元组

In [1311]: pd.Series(list(
                zip(*[[v for L in df[c] for v in L] for c in cols]))).value_counts()
Out[1311]:
(adr4r, abc)     2
(fg67y, pqr)     1
(ght43, xyz)     1
(ty67g, abc)     1
(5ty6gh, xyz)    1
dtype: int64

大熊猫不寻常的计数

2 个答案: