我有一个带有以下数据点的DataFrame。这里count
表示读取articleTag的次数。 articleTag
是articleId的tag
;即对于articleId 590020
,有四个标记A,B,C,D
,表示为一个字符串。
articleId articleTag count
0 590020 A,B,C,D 2
1 466322 A,B,E 3
2 466322 B 2
3 466322 A 1
我需要找出tag distribution
,即每个标签在文章中出现的次数和阅读次数。
与上面的示例Dataframe一样。
Tag Present Read
A 3 6
B 3 7
C 1 2
D 1 2
E 1 3
帮助。
答案 0 :(得分:1)
你可以这样做:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([{"articleId": 590020, "articleTag": "A,B,C,D ", "count": 6}, {"articleId": 590021, "articleTag": "A,B,E", "count": 3}])
In [3]: df[df.articleTag.str.contains("A")]['count'].sum()
Out: 9
In [4]: len(df[df.articleTag.str.contains("A")])
Out[4]: 2
第一个是您的“读取”值,第二个是您的“现在”值。
要查找所有各种标签,我可能会这样做:
In [5]: tag_df = df.articleTag.str.split(',', expand=True)
In [6]: for column in tag_df.columns:
...: print(tag_df[column].unique())
...:
...:
['A']
['B']
['C' 'E']
['D ' None]
您可以将它们添加到set
,而不是打印它们,并收集您需要查找的所有标记。
In [7]: unique_tags = set()
In [8]: for column in tag_df.columns:
...: unique_tags |= set(tag_df[column].unique())
...:
...:
...:
...:
In [9]: unique_tags
Out[9]: {'B', 'A', 'C', 'E', None, 'D '}
当然,你必须拔出无价值。
答案 1 :(得分:1)
df = pd.DataFrame([{"articleId": 590020, "articleTag": "A,B,C,D ", "count": 2},
{"articleId": 590021, "articleTag": "A,B,E", "count": 3},
{"articleId": 466322, "articleTag": "B", "count": 2},
{"articleId": 466322, "articleTag": "A", "count": 1}])
articles = []
for val in df['articleTag'].values:
articles.extend(val.split(','))
unique_articles = list(set(articles))
final_dict = {}
final_dict['article'] = unique_articles
final_count = []
final_read = []
for article in unique_articles:
l = [val for val in df['articleTag'].values if article in val]
l2 = [val[1] for val in zip(df['articleTag'].values,df['count'].values) if article in val[0]]
final_read.append(sum(l2))
final_count.append(len(l))
final_dict['Present'] = final_count
final_dict['Read'] = final_read
pd.DataFrame(final_dict)