我想计算专栏中的顶级主题。有些字段有逗号或点我想用它们创建一个新行。
import pandas as pd
from pandas import DataFrame, Series
sbj = DataFrame(["Africa, Business", "Oceania",
"Business.Biology.Pharmacology.Therapeutics",
"French Litterature, Philosophy, Arts", "Biology,Business", ""
])
sbj
我想分成一个新的任何字段,其中包含'。'或者'。'
sbj_top = sbj[0].apply(lambda x: pd.value_counts(x.split(",")) if not pd.isnull(x) else pd.value_counts('---'.split(","))).sum(axis = 0)
sbj_top
我在尝试重新拆分('。')时收到错误(AttributeError
)
sbj_top = sbj_top.apply(lambda x: pd.value_counts(x.split(".")) if not pd.isnull(x) else pd.value_counts('---'.split(","))).sum(axis = 0)
sbj_top
我想要的输出
sbj_top.sort(ascending=False)
plt.title("Distribution of the top 10 subjects")
plt.ylabel("Frequency")
sbj_top.head(10).plot(kind='bar', color="#348ABD")
答案 0 :(得分:1)
您可以将Counter与来自itertools的链一起使用。请注意,我首先在解析之前用逗号替换句点。
from collections import Counter
import itertools
from string import whitespace
trimmed_list = [i.replace('.', ',').split(',') for i in sbj[0].tolist() if i != ""]
item_list = [item.strip(whitespace) for item in itertools.chain(*trimmed_list)]
item_count = Counter(item_list)
>>> item_count.most_common()
[('Business', 3),
('Biology', 2),
('Oceania', 1),
('Pharmacology', 1),
('Philosophy', 1),
('Africa', 1),
('French Litterature', 1),
('Therapeutics', 1),
('Arts', 1)]
如果您需要DataFrame形式的输出:
df = pd.DataFrame(item_list, columns=['subject'])
>>> df
subject
0 Africa
1 Business
2 Oceania
3 Business
4 Biology
5 Pharmacology
6 Therapeutics
7 French Litterature
8 Philosophy
9 Arts
10 Biology
11 Business