谁能解释一下这个功能在做什么?就像我知道它检查csv中的行是否重复。但是,我只想检查特定列是否具有重复值。我怎么做?
@Validator
def hasDuplicates( fileInDf, fileType = File_Name_All, kwargs = def_kwargs ):
''' Return row indexes that are duplicates '''
import pandas
if fileInDf is None:
return ValidatorResponse( rule_decision = Rule_Decision.INVALID_INPUT, rule_return_message = 'Input File is not a valid file for rule : hasDuplicates' )
if type( fileInDf ) is not pandas.DataFrame:
return ValidatorResponse( rule_decision = Rule_Decision.INVALID_INPUT, rule_return_message = 'Type %s is not a valid DataFrame Type for rule : hasDuplicates' % type( fileInDf ))
if fileInDf.empty:
return ValidatorResponse( rule_decision = Rule_Decision.INVALID_INPUT, rule_return_message = 'Input File is not a valid file for rule : hasDuplicates' )
dups = fileInDf.duplicated()
indexes = dups[ dups == True ].index.tolist()
fixedDf = fileInDf.drop_duplicates()
ret = Rule_Decision.FAILED if len( fixedDf ) != len( fileInDf ) else Rule_Decision.SUCCESS
return ValidatorResponse( rule_decision = ret, rule_return_fixedDf = fixedDf, rule_return_val = indexes )
更新:
@Validator
def hasDuplicatesSingleColumn( val, fileInDf, fileType = File_Name_All, kwargs = def_kwargs ):
''' Return row indexes that are duplicates '''
import pandas
if fileInDf is None:
return ValidatorResponse( rule_decision = Rule_Decision.INVALID_INPUT, rule_return_message = 'Input File is not a valid file for rule : hasDuplicates' )
if type( fileInDf ) is not pandas.DataFrame:
return ValidatorResponse( rule_decision = Rule_Decision.INVALID_INPUT, rule_return_message = 'Type %s is not a valid DataFrame Type for rule : hasDuplicates' % type( fileInDf ))
if fileInDf.empty:
return ValidatorResponse( rule_decision = Rule_Decision.INVALID_INPUT, rule_return_message = 'Input File is not a valid file for rule : hasDuplicates' )
col_dups = fileInDf[['column']].duplicated()
indexes = col_dups[ col_dups == True ].index.tolist()
new_df = fileInDf[['column']].drop_duplicates()
ret = Rule_Decision.FAILED if len( new_df ) != len( fileInDf ) else Rule_Decision.SUCCESS
return ValidatorResponse( rule_decision = ret, rule_return_fixedDf = new_df, rule_return_val = indexes )
但是,我如何获得索引?这是上述功能的正确方法吗?
答案 0 :(得分:1)
您只想知道某列是否有重复项?有几种方法可以做到这一点。这是一个简单的问题:
n
仅当列中没有重复值时,才会返回len(fileInDf.groupby('column').sum()) == len(fileInDf['column'])
。
另一个是创建单列数据框并在那里使用True
:
drop_duplicates
现在看看两者是否具有相同的长度
new_df = fileInDf[['column']].drop_duplicates()
最后,您可以像这样使用len(new_df) == len(fileInDf)
:
duplicated
如果存在重复的值,则该语句将返回True in fileInDf[['column']].duplicated()
请注意True
生成一个由一列组成的数据框,与生成Series对象的fileInDf[['column']]
不同。