计数次数String的一部分是DataFrame

时间:2017-10-13 04:09:09

标签: python-2.7 dataframe

我有一个带有以下数据点的DataFrame。这里count表示读取articleTag的次数。 articleTag是articleId的tag;即对于articleId 590020,有四个标记A,B,C,D,表示为一个字符串。

      articleId     articleTag       count  
  0     590020      A,B,C,D             2   
  1     466322      A,B,E               3   
  2     466322      B                   2   
  3     466322      A                   1   

我需要找出tag distribution,即每个标签在文章中出现的次数和阅读次数。

与上面的示例Dataframe一样。

Tag       Present       Read
A           3            6
B           3            7
C           1            2
D           1            2
E           1           3

帮助。

2 个答案:

答案 0 :(得分:1)

你可以这样做:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame([{"articleId": 590020, "articleTag": "A,B,C,D ", "count": 6}, {"articleId": 590021, "articleTag": "A,B,E", "count": 3}])

In [3]: df[df.articleTag.str.contains("A")]['count'].sum()
Out: 9

In [4]: len(df[df.articleTag.str.contains("A")])
Out[4]: 2

第一个是您的“读取”值,第二个是您的“现在”值。

要查找所有各种标签,我可能会这样做:

In [5]: tag_df = df.articleTag.str.split(',', expand=True)

In [6]: for column in tag_df.columns:
...:     print(tag_df[column].unique())
...:     
...:     
['A']
['B']
['C' 'E']
['D ' None]

您可以将它们添加到set,而不是打印它们,并收集您需要查找的所有标记。

In [7]: unique_tags = set()

In [8]: for column in tag_df.columns:
    ...:     unique_tags |= set(tag_df[column].unique())
    ...:     
    ...:     
    ...:     
    ...:     

In [9]: unique_tags
Out[9]: {'B', 'A', 'C', 'E', None, 'D '}

当然,你必须拔出无价值。

答案 1 :(得分:1)

df = pd.DataFrame([{"articleId": 590020, "articleTag": "A,B,C,D ", "count": 2}, 
                   {"articleId": 590021, "articleTag": "A,B,E", "count": 3},
                  {"articleId": 466322, "articleTag": "B", "count": 2},
                   {"articleId": 466322, "articleTag": "A", "count": 1}])
articles = []
for val in df['articleTag'].values:
    articles.extend(val.split(','))
unique_articles = list(set(articles))

final_dict = {}
final_dict['article'] = unique_articles
final_count = []
final_read = []
for article in unique_articles:
        l = [val for val in df['articleTag'].values if article in val]
        l2 = [val[1] for val in zip(df['articleTag'].values,df['count'].values) if article in val[0]]
        final_read.append(sum(l2))
        final_count.append(len(l))
final_dict['Present'] = final_count
final_dict['Read'] = final_read
pd.DataFrame(final_dict)