理解Python中的数据框架

时间:2016-12-14 21:24:33

标签: python csv pandas dataframe

谁能解释一下这个功能在做什么?就像我知道它检查csv中的行是否重复。但是,我只想检查特定列是否具有重复值。我怎么做?

@Validator
def hasDuplicates( fileInDf, fileType = File_Name_All, kwargs = def_kwargs ):
    ''' Return row indexes that are duplicates '''
    import pandas

    if fileInDf is None:
        return ValidatorResponse( rule_decision = Rule_Decision.INVALID_INPUT, rule_return_message = 'Input File is not a valid file for rule : hasDuplicates' )
    if type( fileInDf ) is not pandas.DataFrame:
        return ValidatorResponse( rule_decision = Rule_Decision.INVALID_INPUT, rule_return_message = 'Type %s is not a valid DataFrame Type for rule : hasDuplicates' % type( fileInDf ))
    if fileInDf.empty:
        return ValidatorResponse( rule_decision = Rule_Decision.INVALID_INPUT, rule_return_message = 'Input File is not a valid file for rule : hasDuplicates' )

    dups = fileInDf.duplicated()
    indexes = dups[ dups == True ].index.tolist()
    fixedDf = fileInDf.drop_duplicates()

    ret = Rule_Decision.FAILED if len( fixedDf ) != len( fileInDf ) else Rule_Decision.SUCCESS
    return ValidatorResponse( rule_decision = ret, rule_return_fixedDf = fixedDf, rule_return_val = indexes )

更新:

@Validator
def hasDuplicatesSingleColumn( val, fileInDf, fileType = File_Name_All, kwargs = def_kwargs ):
    ''' Return row indexes that are duplicates '''
    import pandas

    if fileInDf is None:
        return ValidatorResponse( rule_decision = Rule_Decision.INVALID_INPUT, rule_return_message = 'Input File is not a valid file for rule : hasDuplicates' )
    if type( fileInDf ) is not pandas.DataFrame:
        return ValidatorResponse( rule_decision = Rule_Decision.INVALID_INPUT, rule_return_message = 'Type %s is not a valid DataFrame Type for rule : hasDuplicates' % type( fileInDf ))
    if fileInDf.empty:
        return ValidatorResponse( rule_decision = Rule_Decision.INVALID_INPUT, rule_return_message = 'Input File is not a valid file for rule : hasDuplicates' )

    col_dups = fileInDf[['column']].duplicated()
    indexes = col_dups[ col_dups == True ].index.tolist()
    new_df = fileInDf[['column']].drop_duplicates()

    ret = Rule_Decision.FAILED if len( new_df ) != len( fileInDf ) else Rule_Decision.SUCCESS
    return ValidatorResponse( rule_decision = ret, rule_return_fixedDf = new_df, rule_return_val = indexes )

但是,我如何获得索引?这是上述功能的正确方法吗?

1 个答案:

答案 0 :(得分:1)

您只想知道某列是否有重复项?有几种方法可以做到这一点。这是一个简单的问题:

n
仅当列中没有重复值时,

才会返回len(fileInDf.groupby('column').sum()) == len(fileInDf['column'])

另一个是创建单列数据框并在那里使用True

drop_duplicates

现在看看两者是否具有相同的长度

new_df = fileInDf[['column']].drop_duplicates()

最后,您可以像这样使用len(new_df) == len(fileInDf)

duplicated

如果存在重复的值,则该语句将返回True in fileInDf[['column']].duplicated()

请注意True生成一个由一列组成的数据框,与生成Series对象的fileInDf[['column']]不同。