我想找到一种方法来更快地计算成对准确性,即将比较同一数组中的元素(在本例中为panda df列),计算它们之间的差异,然后比较所获得的两个结果。我将有一个数据框 df ,其中包含3列(文档的 id , Jugment ),它们代表人工评估,并且是一个int对象, PR_score 代表该文档的pagerank,它是一个float对象),我想检查一下他们是否同意对一个文档进行更好/最差的分类。
例如:
id :id1,id2,id3
比赛:1、0、0
PR_分数:0.18、0.5、0.12
在这种情况下,两个分数在对id1的分类上优于对id3的分类,对id1和id2的分类不同,并且在id2和id3之间存在人为的判断力,因此我的成对准确性是:
协议 = 1
分歧 = 1
成对准确性 =同意/(同意+反对)= 1/2 = 0.5
这是我第一个解决方案的代码,其中我将df的列用作数组(这有助于减少计算时间):
def pairwise(agree, disagree):
return(agree/(agree+disagree))
def pairwise_computing_array(df):
humanScores = np.array(df['Judgement'])
pagerankScores = np.array(df['PR_Score'])
total = 0
agree = 0
disagree = 0
for i in range(len(df)-1):
for j in range(i+1, len(df)):
total += 1
human = humanScores[i] - humanScores[j] #difference human judg
if human != 0:
pr = pagerankScores[i] - pagerankScores[j]#difference pagerank score
if pr != 0:
if np.sign(human) == np.sign(pr):
agree += 1 #they agree in which of the two is better
else:
disagree +=1 #they do not agree in which of the two is better
else:
continue;
else:
continue;
pairwise_accuracy = pairwise(agree, disagree)
return(agree, disagree, total, pairwise_accuracy)
我尝试使用列表理解来获得更快的计算速度,但实际上比第一种解决方案要慢:
def pairwise_computing_list_comprehension(df):
humanScores = np.array(df['Judgement'])
pagerankScores = np.array(judgmentPR['PR_Score'])
sign = [np.sign(pagerankScores[i] - pagerankScores[j]) == np.sign(humanScores[i] - humanScores[j] )
for i in range(len(df)) for j in range(i+1, len(df))
if (np.sign(pagerankScores[i] - pagerankScores[j]) != 0
and np.sign(humanScores[i] - humanScores[j])!=0)]
agreement = sum(sign)
disagreement = len(sign) - agreement
pairwise_accuracy = pairwise(agreement, disagreement)
return(agreement, disagreement, pairwise_accuracy)
我无法在整个数据集上运行,因为它花费了太多时间,所以我希望可以在不到1分钟的时间内计算出一些东西。
在我的计算机上对1000行的一小部分进行的计算达到了以下性能:
code1: 每个循环1.57 s±3.15 ms(平均±标准偏差,共运行7次,每个循环1次)
code2: 每个循环3.51 s±10.7 ms(平均±标准偏差,共运行7次,每个循环1次)
答案 0 :(得分:1)
您有numpy数组,为什么不只使用它呢?您可以将工作从Python卸载到C编译的代码中(通常但并非总是如此):
首先,将向量的大小调整为1xN个矩阵:
humanScores = np.array(df['Judgement']).resize((1,-1))
pagerankScores = np.array(judgmentPR['PR_Score']).resize((1,-1))
然后找到区别,我们只对标志感兴趣:
humanDiff = (humanScores - humanScores.T).clip(-1,1)
pagerankDiff = (pagerankScores - pagerankScores.T).clip(-1,1)
这里我假设数据是整数,所以clip
函数只会产生-1、0或1。然后可以对它进行计数:
agree = ((humanDiff != 0) & (pagerankDiff != 0) & (humanDiff == pagerankDiff)).sum()
disagree = ((humanDiff != 0) & (pagerankDiff != 0) & (humanDiff != pagerankDiff)).sum()
但是上述计数是重复计算的,因为项目(i,j)和项目(j,i)在humanDiff
和pagerankDiff
中都是正确的相反符号。您可以考虑只求和求方阵的上三角部分:
agree = ((humanDiff != 0) &
(pagerankDiff != 0) &
(np.triu(humanDiff) == np.triu(pagerankDiff))
).sum()
答案 1 :(得分:1)
这是在合理的时间内工作的代码,这要感谢@ juanpa.arrivillaga的建议:
from numba import jit
@jit(nopython = True)
def pairwise_computing(humanScores, pagerankScores):
total = 0
agree = 0
disagree = 0
for i in range(len(humanScores)-1):
for j in range(i+1, len(humanScores)):
total += 1
human = humanScores[i] - humanScores[j] #difference human judg
if human != 0:
pr = pagerankScores[i] - pagerankScores[j]#difference pagerank score
if pr != 0:
if np.sign(human) == np.sign(pr):
agree += 1 #they agree in which of the two is better
else:
disagree +=1 #they do not agree in which of the two is better
else:
continue
else:
continue
pairwise_accuracy = agree/(agree+disagree)
return(agree, disagree, total, pairwise_accuracy)
这是我的整个数据集(58,000行)达到的性能:
每个循环7.98 s±2.78 ms(平均±标准偏差,共运行7次,每个循环1次)
答案 2 :(得分:1)
通过利用广播,可以摆脱内部for
循环,因为索引j
总是比索引i
领先1(即我们不回头)。但是,以下几行中的计算协议/分歧存在一个小问题:
if np.sign(human) == np.sign(pr):
我不知道该如何解决。因此,由于您更了解问题,因此我仅在此处提供框架代码以进行更多调整并使其起作用。在这里:
def pairwise_computing_array(df):
humanScores = df['Judgement'].values
pagerankScores = df['PR_Score'].values
total = 0
agree = 0
disagree = 0
for i in range(len(df)-1):
j = i+1
human = humanScores[i] - humanScores[j:] #difference human judg
human_mask = human != 0
if np.sum(human_mask) > 0: # check for at least one positive case
pr = pagerankScores[i] - pagerankScores[j:][human_mask] #difference pagerank score
pr_mask = pr !=0
if np.sum(pr_mask) > 0: # check for at least one positive case
# TODO: issue arises here; how to resolve when (human.shape != pr.shape) ?
# once this `if ... else` block is fixed, it's done
if np.sign(human) == np.sign(pr):
agree += 1 #they agree in which of the two is better
else:
disagree +=1 #they do not agree in which of the two is better
else:
continue
else:
continue
pairwise_accuracy = pairwise(agree, disagree)
return(agree, disagree, total, pairwise_accuracy)