如何从数据透视表中加权单词数

时间:2019-01-29 07:04:46

标签: pandas dataframe

这是我的数据透视表

No  Keyword              Count
1   Sell Laptop Online   10
2   Buy Computer Online  8
3   Laptop and Case      5

这就是我想要的

No   Word      Count
1    Online    18
2    Laptop    15
3    Sell      10
4    Buy        8
5    Computer   8
6    and        5
7    Case       5 

我所做的是

df['Word'].apply(lambda x: x.str.split(expand=True).stack()).stack().value_counts()

但是结果是

No   Word      Count
1    Online    2
2    Laptop    2
3    Sell      1
4    Buy       1
5    Computer  1
6    and       1
7    Case      1 

我想对数据透视表中的字数进行加权

2 个答案:

答案 0 :(得分:2)

使用:

df1 = (df.set_index('Count')['Keyword']
         .str.split(expand=True)
         .stack()
         .reset_index(name='Word')
         .groupby('Word')['Count']
         .sum()
         .sort_values(ascending=False)
         .reset_index())

说明

  1. Count设置为set_index的索引,以防止丢失此信息
  2. 通过split创建DataFrame
  3. stack重塑
  4. MultiIndex转换为reset_index的列
  5. 总计sum
  6. Series.sort_values排序Series
  7. 最后reset_index

另一种解决方案-如果DataFrame较大,则更快:

from itertools import chain

s = df['Keyword'].str.split()

df = pd.DataFrame({
    'Word' : list(chain.from_iterable(s.values.tolist())), 
    'Count' : df['Count'].repeat(s.str.len())
})

print (df)
       Word  Count
0      Sell     10
0    Laptop     10
0    Online     10
1       Buy      8
1  Computer      8
1    Online      8
2    Laptop      5
2       and      5
2      Case      5

df1 = df.groupby('Word')['Count'].sum().sort_values(ascending=False).reset_index()
print (df1)
       Word  Count
0    Online     18
1    Laptop     15
2      Sell     10
3  Computer      8
4       Buy      8
5       and      5
6      Case      5

说明

  1. 第一重复Count通过的分裂值的值Keyword,以新的数据帧
  2. 汇总sum,对Series和最后reset_index进行排序

使用defaultdict的解决方案:

from collections import defaultdict

out = defaultdict(int)
for k, c in zip(df['Keyword'], df['Count']):
    for x in k.split():
        out[x] += c

print (out)
defaultdict(<class 'int'>, {'Sell': 10,
                            'Laptop': 15, 
                            'Online': 18, 
                            'Buy': 8, 
                            'Computer': 8,
                            'and': 5,
                            'Case': 5})

#sorting by values and DataFrame constructor
#https://stackoverflow.com/a/613218
df = pd.DataFrame(sorted(out.items(), key=lambda kv: kv[1], reverse=True),
                  columns=['Word','Count'])
print (df)

       Word  Count
0    Online     18
1    Laptop     15
2      Sell     10
3       Buy      8
4  Computer      8
5       and      5
6      Case      5

<强>性能 - 取决于实际的数据,但似乎溶液defaultdict是最快的:

np.random.seed(456)

import string
from itertools import chain
from collections import defaultdict


a = np.random.randint(0, 20, 10000)
b = [' '.join(np.random.choice(list(string.ascii_letters), 
                               np.random.randint(3, 5))) for _ in range(len(a))]

df = pd.DataFrame({"Keyword":b, "Count":a})
#print (df)

In [49]: %%timeit
    ...: f1 = (df.set_index('Count')['Keyword']
    ...:          .str.split(expand=True)
    ...:          .stack()
    ...:          .reset_index(name='Word')
    ...:          .groupby('Word')['Count']
    ...:          .sum()
    ...:          .sort_values(ascending=False)
    ...:          .reset_index())
    ...: 
35.5 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [52]: %%timeit
    ...: from itertools import chain
    ...: 
    ...: s = df['Keyword'].str.split()
    ...: 
    ...: pd.DataFrame({
    ...:     'Word' : list(chain.from_iterable(s.values.tolist())), 
    ...:     'Count' : df['Count'].repeat(s.str.len())
    ...: }).groupby('Word')['Count'].sum().sort_values(ascending=False).reset_index()
    ...: 
14.5 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [53]: %%timeit
    ...: from collections import defaultdict
    ...: 
    ...: out = defaultdict(int)
    ...: for k, c in zip(df['Keyword'], df['Count']):
    ...:     for x in k.split():
    ...:         out[x] += c
    ...: pd.DataFrame(sorted(out.items(), key=lambda kv: kv[1], reverse=True), columns=['Word','Count'])
    ...: 
8.82 ms ± 25.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#Dark solution
In [54]: %%timeit
    ...: df['Keyword'].str.get_dummies(sep=' ').mul(df['Count'],0).sum(0).to_frame('Count')
    ...: 
307 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

答案 1 :(得分:1)

这是一种简单的方法,只有一种热编码。

df['Keyword'].str.get_dummies(sep=' ').mul(df['Count'],axis=0).sum(0).to_frame('Count')

          Count
Buy           8
Case          5
Computer      8
Laptop       15
Online       18
Sell         10
and           5

如果速度有所提高,请尝试使用scikit的多标签二值化器。即

from sklearn.preprocessing import MultiLabelBinarizer
vec = MultiLabelBinarizer()

oh = (vec.fit_transform(df['Keyword'].str.split()) * df['Count'].values[:,None]).sum(0)
pd.DataFrame({'Count': oh ,'Word':vec.classes_})

说明

获取假人将导致热编码数据帧变热,

    Buy  Case  Computer  Laptop  Online  Sell  and
 0    0     0         0       1       1     1    0
 1    1     0         1       0       1     0    0
 2    0     1         0       1       0     0    1

乘以列中的计数

   Buy  Case  Computer  Laptop  Online  Sell  and
0    0     0         0      10      10    10    0
1    8     0         8       0       8     0    0
2    0     5         0       5       0     0    5

求和并转换为数据框。

Buy          8
Case         5
Computer     8
Laptop      15
Online      18
Sell        10
and          5
dtype: int64