如何构建熊猫数据框中的项目频率计数表?

时间:2018-07-11 07:57:33

标签: python pandas indexing word-frequency frequency-distribution

假设我在csv文件example.csv中包含以下数据:

Word    Score
Dog     1
Bird    2
Cat     3
Dog     2
Dog     3
Dog     1
Bird    3
Cat     1
Bird    1
Cat     3

我想为每个分数计算每个单词的出现频率。预期的输出如下:

        1   2   3
Dog     2   1   1
Bird    0   1   1
Cat     1   0   2

我执行此操作的代码如下:

将熊猫作为pd导入

x1 = pd.read_csv(r'path\to\example.csv')

def getUniqueWords(allWords) :
    uniqueWords = [] 
    for i in allWords:
        if not i in uniqueWords:
            uniqueWords.append(i)
    return uniqueWords

unique_words = getUniqueWords(x1['Word'])
unique_scores = getUniqueWords(x1['Score'])

scores_matrix = [[0 for x in range(len(unique_words))] for x in range(len(unique_scores)+1)]   
# The '+1' is because Python indexing starts from 0; so if a score of 0 is present in the data, the 0 index will be used for that. 

for i in range(len(unique_words)):
    temp = x1[x1['Word']==unique_words[i]]
    for j, word in temp.iterrows():
        scores_matrix[i][j] += 1  # Supposed to store the count for word i with score j

但这会产生以下错误:

IndexError                                Traceback (most recent call last)
<ipython-input-123-141ab9cd7847> in <module>()
     19     temp = x1[x1['Word']==unique_words[i]]
     20     for j, word in temp.iterrows():
---> 21         scores_matrix[i][j] += 1

IndexError: list index out of range

此外,即使我可以解决此错误,scores_matrix也不会显示标题(DogBirdCat作为行索引,而{{1 }},12作为列索引)。我希望能够通过每个分数访问每个单词的计数-这样可以达到目的:

3

那么,我该如何解决/解决这两个问题?

1 个答案:

答案 0 :(得分:3)

groupby与sort = False一起使用,将value_countssizeunstack一起使用:

df1 = df.groupby('Word', sort=False)['Score'].value_counts().unstack(fill_value=0)

df1 = df.groupby(['Word','Score'], sort=False).size().unstack(fill_value=0)

print (df1)
Score  1  2  3
Word          
Dog    2  1  1
Bird   1  1  1
Cat    1  0  2

如果顺序不重要,请使用crosstab

df1 = pd.crosstab(df['Word'], df['Score'])
print (df1)
Score  1  2  3
Word          
Bird   1  1  1
Cat    1  0  2
Dog    2  1  1

最后按带有DataFrame.loc的标签选择:

print (df.loc['Cat', 2])
0