我正在评估数据库记录的质量。我有一个以二进制形式表示的csv,它告诉我每条记录的特定字段是否丢失。我的目标是通过为字段分配不同的权重来创建得分,然后计算最终得分1-100。以下是我正在使用的视觉效果。前两行是权重,下三行是我的df中的内容。
我能够通过使用numpy.where()将1和0与权重进行匹配来进行计算,将结果写到单独的列中,并使用新列来计算得分。我想知道是否有更有效的方法来做到这一点。我的代码如下(我缩短了冗余部分以节省一些空间)
df = pd.read_csv(path, names=names, usecols=usecols, header=None)
# Main Scoring Weights
contact_info_weight = 0.5
relations_weight = 0.25
demos_weight = 0.25
# Scoring Sub Rations - contact_info
org_address_rate = 30
org_city_rate = 30
.......other.....
# Scoring Sub Rations - relations
related_ind_rate = 40
related_org_rate = 20
related_pc_rate = 50
# Scoring Sub Rations - demos
org_type_rate = 40
market_segment_rate = 30
process_rate = 30
# Create new columns with score values
df['Org Address Line 1 Score'] = np.where(df['Org Address Line 1'] == 1, org_address_rate, 0)
df['Org City Score'] = np.where(df['Org City'] == 1, org_city_rate, 0)
.......other.....
# Calculate total sub scores
df['Contact Info Score'] = df['Org Address Line 1 Score'] + df['Org City Score'] + df['Org State Score'] + \
df['Org Postal Code Score'] + df['Org Country Score'] + df['Org Phone Score']
df['Demos Score'] = df['Org Type Score'] + df['Market Segment Score'] + df['Process Score']
df['Relations Score'] = df['Ind ID Score'] + df['Market Segment Score'] + df['Process Score']
# Calculate final score
df['Completeness Score'] = (df['Contact Info Score'] * contact_info_weight) + (df['Demos Score'] * demos_weight) + \
(df['Relations Score'] * relations_weight)