Question

我正在评估数据库记录的质量。我有一个以二进制形式表示的csv，它告诉我每条记录的特定字段是否丢失。我的目标是通过为字段分配不同的权重来创建得分，然后计算最终得分1-100。以下是我正在使用的视觉效果。前两行是权重，下三行是我的df中的内容。

我能够通过使用numpy.where（）将1和0与权重进行匹配来进行计算，将结果写到单独的列中，并使用新列来计算得分。我想知道是否有更有效的方法来做到这一点。我的代码如下（我缩短了冗余部分以节省一些空间）

df = pd.read_csv(path, names=names, usecols=usecols, header=None)

# Main Scoring Weights
contact_info_weight = 0.5
relations_weight = 0.25
demos_weight = 0.25

# Scoring Sub Rations - contact_info
org_address_rate = 30
org_city_rate = 30
.......other.....

# Scoring Sub Rations - relations
related_ind_rate = 40
related_org_rate = 20
related_pc_rate = 50

# Scoring Sub Rations - demos
org_type_rate = 40
market_segment_rate = 30
process_rate = 30

# Create new columns with score values
df['Org Address Line 1 Score'] = np.where(df['Org Address Line 1'] == 1, org_address_rate, 0)
df['Org City Score'] = np.where(df['Org City'] == 1, org_city_rate, 0)
.......other.....

# Calculate total sub scores
df['Contact Info Score'] = df['Org Address Line 1 Score'] + df['Org City Score'] + df['Org State Score'] + \
                     df['Org Postal Code Score'] + df['Org Country Score'] + df['Org Phone Score']

df['Demos Score'] = df['Org Type Score'] + df['Market Segment Score'] + df['Process Score']

df['Relations Score'] = df['Ind ID Score'] + df['Market Segment Score'] + df['Process Score']

# Calculate final score
df['Completeness Score'] = (df['Contact Info Score'] * contact_info_weight) + (df['Demos Score'] * demos_weight) + \
                           (df['Relations Score'] * relations_weight)

在数据框列上执行计算而无需创建新列

0 个答案: