For循环在DataFrame中返回唯一值

时间:2019-04-22 13:24:44

标签: python pandas

我正在研究初学者的ML代码,为了计算一列中唯一样本的数量,作者使用了以下代码:

def unique_vals(rows, col):
    """Find the unique values for a column in a dataset."""
    return set([row[col] for row in rows])

但是我正在使用DataFrame,对于我来说,此代码返回单个字母:'m','l'等。我尝试将其更改为:

set(row[row[col] for row in rows)

但随后返回:

KeyError: "None of [Index(['Apple', 'Banana', 'Grape'   dtype='object', length=2318)] are in the [columns]"

感谢您的时间!

2 个答案:

答案 0 :(得分:4)

通常,您不需要自己做这样的事情,因为package MaxMinArrayIndex.bozhko; public class MaxMinArrayIndex { public static void main(String[] args) { gettingIndex(); } private static int gettingIndex(int[]) { int[] myArray = {35, 2, 64, -18, 1000, 10000}; int max = myArray[0]; int indexForMax = 0; for (int i = 0; i < myArray.length; i++) { int score = myArray[i]; if (max < score) { max = score; indexForMax = i; } } int min = myArray[0]; int indexForMin = 0; for (int i = 0; i < myArray.length; i++) { int score = myArray[i]; if (min > score) { min = score; indexForMin = i; } } } 已经为您做了这些事情。

在这种情况下,您需要的是pandas方法,您可以直接在unique上调用(Series是表示列的抽象)。 ,并返回一个pd.Series数组,其中包含该numpy中的唯一值。

如果想要多个列的唯一值,则可以执行以下操作:

Series

答案 1 :(得分:1)

如果您正在处理分类列,那么以下代码非常有用

它不仅会打印唯一值,还会打印每个唯一值的计数

col = ['col1', 'col2', 'col3'...., 'coln']

#Print frequency of categories
for col in categorical_columns:
    print ('\nFrequency of Categories for varible %s'%col)
    print (bd1[col].value_counts())

示例:

df

     pets     location     owner
0     cat    San_Diego     Champ
1     dog     New_York       Ron
2     cat     New_York     Brick
3  monkey    San_Diego     Champ
4     dog    San_Diego  Veronica
5     dog     New_York       Ron


categorical_columns = ['pets','owner','location']
#Print frequency of categories
for col in categorical_columns:
    print ('\nFrequency of Categories for varible %s'%col)
    print (df[col].value_counts())

输出:

# Frequency of Categories for varible pets
# dog       3
# cat       2
# monkey    1
# Name: pets, dtype: int64

# Frequency of Categories for varible owner
# Champ       2
# Ron         2
# Brick       1
# Veronica    1
# Name: owner, dtype: int64

# Frequency of Categories for varible location
# New_York     3
# San_Diego    3
# Name: location, dtype: int64