Question

我有一个场景，我会测试新主题的一系列特征，其结果都是字符串分类值。一旦测试完成，我需要将新数据集与所有主题的主数据集进行比较，并寻找给定阈值保持的相似性（匹配）（比如说90％）。

因此，我需要能够对新数据集中的每个新主题与主数据集中的每个列以及新数据集中的其他列进行列式（主题）比较。可能的最佳性能，因为生产数据集有大约50万列（并且正在增长）和10,000行。

以下是一些示例代码：

master = pd.DataFrame({'Characteristic':['C1', 'C2', 'C3'], 
                                   'S1':['AA','BB','AB'],
                                   'S2':['AB','-','BB'],
                                   'S3':['AA','AB','--']})
new = pd.DataFrame({'Characteristic':['C1', 'C2', 'C3'], 
                                'S4':['AA','BB','AA'],
                                'S5':['AB','-','BB']})
new_master = pd.merge(master, new, on='Characteristic', how='inner')  

def doComparison(comparison_df, new_columns, master_columns):
  summary_dict = {}
  row_cnt = comparison_df.shape[0]

  for new_col_idx, new_col in enumerate(new_columns):
      # don't compare the Characteristic column
      if new_col != 'Characteristic':
        print 'Evalating subject ' + new_col + ' for matches'
        summary_dict[new_col] = []
        new_data = comparison_df.ix[:, new_col]
        for master_col_idx, master_col in enumerate(master_columns):
            # don't compare same subject or Characteristic column
            if new_col != master_col and master_col != 'Characteristic':
                master_data = comparison_df.ix[:, master_col]
                is_same = (new_data == master_data) & (new_data != '--') & (master_data != '--')
                pct_same = sum(is_same) * 100 / row_cnt
                if pct_same > 90:
                    print '  Found potential match ' + master_col + ' ' + str(pct_same) + ' pct'
                    summary_dict[new_col].append({'match' : master_col, 'pct' : pct_same})
  return summary_dict

result = doComparison(new_master, new.columns, master.columns)

这种方式有效，但我想提高效率和性能，而且不知道如何。

Answer 1

另一个选择

import numpy as np
import pandas as pd
from sklearn.utils.extmath import cartesian

利用sklearn的笛卡尔函数

col_combos = cartesian([ new.columns[1:], master.columns[1:]])
print (col_combos)

[['S4' 'S1']
 ['S4' 'S2']
 ['S4' 'S3']
 ['S5' 'S1']
 ['S5' 'S2']
 ['S5' 'S3']]

创建一个带有键的dict，除了特征之外的新列中的每一列。注意，这似乎是浪费空间。也许只保存匹配的那些？

summary_dict = {c:[] for c in new.columns[1:]} #copied from @Parfait's answer

Pandas / Numpy可以轻松比较两个系列实施例;

print (new_master['S4'] == new_master['S1'])

0     True
1     True
2    False
dtype: bool

现在我们通过系列组合进行迭代，并在numpy的count_nonzero（）的帮助下计算真实数。其余的与你的相似

for combo in col_combos:
    match_count = np.count_nonzero(new_master[combo[0]] == new_master[combo[1]])
    pct_same = match_count * 100 / len(new_master)
    if pct_same > 90:
        summary_dict[combo[0]].append({'match' : combo[1], 'pct': match_count / len(new_master)})

print (summary_dict)

{'S4': [], 'S5': [{'pct': 1.0, 'match': 'S2'}]}

我很想知道它是如何表现的。祝你好运！

Answer 2

考虑以下调整，运行列表推导以构建两个数据帧列名的所有组合，然后迭代到> 90%阈值匹配。

# LIST COMPREHENSION (TUPLE PAIRS) LEAVES OUT CHARACTERISTIC (FIRST COL) AND SAME NAMED COLS
columnpairs = [(i,j) for i in new.columns[1:] for j in master.columns[1:] if i != j]

# DICTIONARY COMPREHENSION TO INITIALIZE DICT OBJ
summary_dict = {col:[] for col in new.columns[1:]}

for p in columnpairs:
    i, j = p

    is_same = (new['Characteristic'] == master['Characteristic']) & \
              (new[i] == master[j]) & (new[i] != '--') & (master[j] != '--')
    pct_same = sum(is_same) * 100 / len(master)

    if pct_same > 90:        
        summary_dict[i].append({'match' : j, 'pct': pct_same})

print(summary_dict)
# {'S4': [], 'S5': [{'match': 'S2', 'pct': 100.0}]}

Pandas：将列与数据框

2 个答案: