以下是我的样本数据
data['text'] data['terms_counter']
This is a sample test {u'This': 1, u'is': 1, u'a': 1, u'sample': 1, , u'test': 1}
put returns between paragraphs {u'put': 1, u'returns': 1, u'between': 1, u'paragraphs': 1}
for linebreak add 2 spaces at end {u'for': 1, u'linebreak': 1, u'add': 1, u'2': 1, u'spaces': 1, u'at': 1, u'end': 1}
indent code by 4 spaces {u'indent': 1, u'code': 1, u'by': 1, u'4': 1, u'spaces': 1}
to make links {u'to': 1, u'make': 1, u'links': 1}
我想检查每一行的余弦相似性,其他所有行都将它附加到字典中,并使其成为数据帧中的一列。以下是我想要的输出,
data['text'] data['cosine_value']
This is a sample test {'put returns between paragraphs': 0.41, 'for linebreak add 2 spaces at end': 0.41, 'indent code by 4 spaces': 0.35, 'to make links': 0.41}
put returns between paragraphs {'This is a sample test': 0.41, 'for linebreak add 2 spaces at end': 0.41, 'indent code by 4 spaces': 0.35, 'to make links': 0.41}
for linebreak add 2 spaces at end {'This is a sample test': 0.41, 'put returns between paragraphs': 0.41, 'indent code by 4 spaces': 0.35, 'to make links': 0.41}
indent code by 4 spaces {'This is a sample test': 0.41, 'put returns between paragraphs': 0.41, 'for linebreak add 2 spaces at end': 0.35, 'to make links': 0.41}
to make links {'This is a sample test': 0.41, 'put returns between paragraphs': 0.41, 'for linebreak add 2 spaces at end': 0.35, 'put returns between paragraphs': 0.41}
我尝试过以下代码,
df = pd.DataFrame()
for i in range(0,len(data)):
print i
j_dict = {}
for j in range(0,len(data)):
text= data['text'][j]
if i != j:
x = get_cosine(data['terms_counter'][i], data['terms_counter'][j])
if sni_name not in j_dict.keys() :
j_dict[sni_name] = round(x,2)
# print j_dict
df = df.append(pd.DataFrame({'text' : data['text'][i], 'similar_ones' : str(j_dict)}, index=[0]), ignore_index=True)
但问题在于我的原始数据,有数百万行,而这一行需要15分钟才能完成10行。任何人都可以帮助我解决这个问题吗?我想要一种更有效的方法来运行这个算法
以下是我认为我们可以解决问题的方法,但我不知道如何继续,
有人能给我任何想法以有效的方式解决这个问题吗?
由于