这是我的数据透视表
No Keyword Count
1 Sell Laptop Online 10
2 Buy Computer Online 8
3 Laptop and Case 5
这就是我想要的
No Word Count
1 Online 18
2 Laptop 15
3 Sell 10
4 Buy 8
5 Computer 8
6 and 5
7 Case 5
我所做的是
df['Word'].apply(lambda x: x.str.split(expand=True).stack()).stack().value_counts()
但是结果是
No Word Count
1 Online 2
2 Laptop 2
3 Sell 1
4 Buy 1
5 Computer 1
6 and 1
7 Case 1
我想对数据透视表中的字数进行加权
答案 0 :(得分:2)
使用:
df1 = (df.set_index('Count')['Keyword']
.str.split(expand=True)
.stack()
.reset_index(name='Word')
.groupby('Word')['Count']
.sum()
.sort_values(ascending=False)
.reset_index())
说明:
Count
设置为set_index
的索引,以防止丢失此信息split
创建DataFrame
stack
重塑MultiIndex
转换为reset_index
的列sum
Series.sort_values
排序Series
reset_index
另一种解决方案-如果DataFrame
较大,则更快:
from itertools import chain
s = df['Keyword'].str.split()
df = pd.DataFrame({
'Word' : list(chain.from_iterable(s.values.tolist())),
'Count' : df['Count'].repeat(s.str.len())
})
print (df)
Word Count
0 Sell 10
0 Laptop 10
0 Online 10
1 Buy 8
1 Computer 8
1 Online 8
2 Laptop 5
2 and 5
2 Case 5
df1 = df.groupby('Word')['Count'].sum().sort_values(ascending=False).reset_index()
print (df1)
Word Count
0 Online 18
1 Laptop 15
2 Sell 10
3 Computer 8
4 Buy 8
5 and 5
6 Case 5
说明:
Count
通过的分裂值的值Keyword
,以新的数据帧sum
,对Series和最后reset_index
进行排序使用defaultdict
的解决方案:
from collections import defaultdict
out = defaultdict(int)
for k, c in zip(df['Keyword'], df['Count']):
for x in k.split():
out[x] += c
print (out)
defaultdict(<class 'int'>, {'Sell': 10,
'Laptop': 15,
'Online': 18,
'Buy': 8,
'Computer': 8,
'and': 5,
'Case': 5})
#sorting by values and DataFrame constructor
#https://stackoverflow.com/a/613218
df = pd.DataFrame(sorted(out.items(), key=lambda kv: kv[1], reverse=True),
columns=['Word','Count'])
print (df)
Word Count
0 Online 18
1 Laptop 15
2 Sell 10
3 Buy 8
4 Computer 8
5 and 5
6 Case 5
<强>性能强> - 取决于实际的数据,但似乎溶液defaultdict
是最快的:
np.random.seed(456)
import string
from itertools import chain
from collections import defaultdict
a = np.random.randint(0, 20, 10000)
b = [' '.join(np.random.choice(list(string.ascii_letters),
np.random.randint(3, 5))) for _ in range(len(a))]
df = pd.DataFrame({"Keyword":b, "Count":a})
#print (df)
In [49]: %%timeit
...: f1 = (df.set_index('Count')['Keyword']
...: .str.split(expand=True)
...: .stack()
...: .reset_index(name='Word')
...: .groupby('Word')['Count']
...: .sum()
...: .sort_values(ascending=False)
...: .reset_index())
...:
35.5 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [52]: %%timeit
...: from itertools import chain
...:
...: s = df['Keyword'].str.split()
...:
...: pd.DataFrame({
...: 'Word' : list(chain.from_iterable(s.values.tolist())),
...: 'Count' : df['Count'].repeat(s.str.len())
...: }).groupby('Word')['Count'].sum().sort_values(ascending=False).reset_index()
...:
14.5 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [53]: %%timeit
...: from collections import defaultdict
...:
...: out = defaultdict(int)
...: for k, c in zip(df['Keyword'], df['Count']):
...: for x in k.split():
...: out[x] += c
...: pd.DataFrame(sorted(out.items(), key=lambda kv: kv[1], reverse=True), columns=['Word','Count'])
...:
8.82 ms ± 25.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Dark solution
In [54]: %%timeit
...: df['Keyword'].str.get_dummies(sep=' ').mul(df['Count'],0).sum(0).to_frame('Count')
...:
307 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
答案 1 :(得分:1)
这是一种简单的方法,只有一种热编码。
df['Keyword'].str.get_dummies(sep=' ').mul(df['Count'],axis=0).sum(0).to_frame('Count')
Count
Buy 8
Case 5
Computer 8
Laptop 15
Online 18
Sell 10
and 5
如果速度有所提高,请尝试使用scikit的多标签二值化器。即
from sklearn.preprocessing import MultiLabelBinarizer
vec = MultiLabelBinarizer()
oh = (vec.fit_transform(df['Keyword'].str.split()) * df['Count'].values[:,None]).sum(0)
pd.DataFrame({'Count': oh ,'Word':vec.classes_})
说明:
获取假人将导致热编码数据帧变热,
Buy Case Computer Laptop Online Sell and
0 0 0 0 1 1 1 0
1 1 0 1 0 1 0 0
2 0 1 0 1 0 0 1
乘以列中的计数
Buy Case Computer Laptop Online Sell and
0 0 0 0 10 10 10 0
1 8 0 8 0 8 0 0
2 0 5 0 5 0 0 5
求和并转换为数据框。
Buy 8
Case 5
Computer 8
Laptop 15
Online 18
Sell 10
and 5
dtype: int64