Question

谁能解释一下这个功能在做什么？就像我知道它检查csv中的行是否重复。但是，我只想检查特定列是否具有重复值。我怎么做？

@Validator
def hasDuplicates( fileInDf, fileType = File_Name_All, kwargs = def_kwargs ):
    ''' Return row indexes that are duplicates '''
    import pandas

    if fileInDf is None:
        return ValidatorResponse( rule_decision = Rule_Decision.INVALID_INPUT, rule_return_message = 'Input File is not a valid file for rule : hasDuplicates' )
    if type( fileInDf ) is not pandas.DataFrame:
        return ValidatorResponse( rule_decision = Rule_Decision.INVALID_INPUT, rule_return_message = 'Type %s is not a valid DataFrame Type for rule : hasDuplicates' % type( fileInDf ))
    if fileInDf.empty:
        return ValidatorResponse( rule_decision = Rule_Decision.INVALID_INPUT, rule_return_message = 'Input File is not a valid file for rule : hasDuplicates' )

    dups = fileInDf.duplicated()
    indexes = dups[ dups == True ].index.tolist()
    fixedDf = fileInDf.drop_duplicates()

    ret = Rule_Decision.FAILED if len( fixedDf ) != len( fileInDf ) else Rule_Decision.SUCCESS
    return ValidatorResponse( rule_decision = ret, rule_return_fixedDf = fixedDf, rule_return_val = indexes )

更新：

@Validator
def hasDuplicatesSingleColumn( val, fileInDf, fileType = File_Name_All, kwargs = def_kwargs ):
    ''' Return row indexes that are duplicates '''
    import pandas

    if fileInDf is None:
        return ValidatorResponse( rule_decision = Rule_Decision.INVALID_INPUT, rule_return_message = 'Input File is not a valid file for rule : hasDuplicates' )
    if type( fileInDf ) is not pandas.DataFrame:
        return ValidatorResponse( rule_decision = Rule_Decision.INVALID_INPUT, rule_return_message = 'Type %s is not a valid DataFrame Type for rule : hasDuplicates' % type( fileInDf ))
    if fileInDf.empty:
        return ValidatorResponse( rule_decision = Rule_Decision.INVALID_INPUT, rule_return_message = 'Input File is not a valid file for rule : hasDuplicates' )

    col_dups = fileInDf[['column']].duplicated()
    indexes = col_dups[ col_dups == True ].index.tolist()
    new_df = fileInDf[['column']].drop_duplicates()

    ret = Rule_Decision.FAILED if len( new_df ) != len( fileInDf ) else Rule_Decision.SUCCESS
    return ValidatorResponse( rule_decision = ret, rule_return_fixedDf = new_df, rule_return_val = indexes )

但是，我如何获得索引？这是上述功能的正确方法吗？

Answer 1

您只想知道某列是否有重复项？有几种方法可以做到这一点。这是一个简单的问题：

仅当列中没有重复值时，

才会返回len(fileInDf.groupby('column').sum()) == len(fileInDf['column'])。

另一个是创建单列数据框并在那里使用True：

drop_duplicates

现在看看两者是否具有相同的长度

new_df = fileInDf[['column']].drop_duplicates()

最后，您可以像这样使用len(new_df) == len(fileInDf)：

duplicated

如果存在重复的值，则该语句将返回True in fileInDf[['column']].duplicated()

请注意True生成一个由一列组成的数据框，与生成Series对象的fileInDf[['column']]不同。

理解Python中的数据框架

1 个答案: