在检查Pandas数据帧中的余弦相似性时避免多个循环

时间:2017-11-02 22:59:40

标签: python python-2.7 pandas

以下是我的样本数据

data['text']                          data['terms_counter']

This is a sample test                 {u'This': 1, u'is': 1, u'a': 1, u'sample': 1, , u'test': 1}
put returns between paragraphs        {u'put': 1, u'returns': 1, u'between': 1, u'paragraphs': 1}
for linebreak add 2 spaces at end     {u'for': 1, u'linebreak': 1, u'add': 1, u'2': 1, u'spaces': 1, u'at': 1, u'end': 1}
indent code by 4 spaces               {u'indent': 1, u'code': 1, u'by': 1, u'4': 1, u'spaces': 1}
to make links                         {u'to': 1, u'make': 1, u'links': 1}

我想检查每一行的余弦相似性,其他所有行都将它附加到字典中,并使其成为数据帧中的一列。以下是我想要的输出,

data['text']                                 data['cosine_value']

This is a sample test                        {'put returns between paragraphs': 0.41, 'for linebreak add 2 spaces at end': 0.41, 'indent code by 4 spaces': 0.35, 'to make links': 0.41}
put returns between paragraphs               {'This is a sample test': 0.41, 'for linebreak add 2 spaces at end': 0.41, 'indent code by 4 spaces': 0.35, 'to make links': 0.41}
for linebreak add 2 spaces at end            {'This is a sample test': 0.41, 'put returns between paragraphs': 0.41, 'indent code by 4 spaces': 0.35, 'to make links': 0.41}
indent code by 4 spaces                      {'This is a sample test': 0.41, 'put returns between paragraphs': 0.41, 'for linebreak add 2 spaces at end': 0.35, 'to make links': 0.41}
to make links                                {'This is a sample test': 0.41, 'put returns between paragraphs': 0.41, 'for linebreak add 2 spaces at end': 0.35, 'put returns between paragraphs': 0.41}

我尝试过以下代码,

df = pd.DataFrame()
for i in range(0,len(data)):
    print i
    j_dict = {}
    for j in range(0,len(data)): 
        text= data['text'][j]
        if i != j:  
        x = get_cosine(data['terms_counter'][i], data['terms_counter'][j])
        if sni_name not in j_dict.keys() :
            j_dict[sni_name] = round(x,2)
#     print j_dict
    df = df.append(pd.DataFrame({'text' : data['text'][i], 'similar_ones' : str(j_dict)}, index=[0]), ignore_index=True)

但问题在于我的原始数据,有数百万行,而这一行需要15分钟才能完成10行。任何人都可以帮助我解决这个问题吗?我想要一种更有效的方法来运行这个算法

以下是我认为我们可以解决问题的方法,但我不知道如何继续,

  1. kdtree implementation
  2. 最近邻居
  3. 有人能给我任何想法以有效的方式解决这个问题吗?

    由于

0 个答案:

没有答案