将Jaccard相似度保存在CSV文件中

时间:2019-01-24 13:32:45

标签: python-3.x pandas

我构建了以下代码来分析Jaccard的相似性:

import pandas as pd
import csv

df = pd.read_csv('data.csv', usecols=[0]
                    ,names=['Question'], 
                       encoding='utf-8')

out = []
for i in df['Question']:
       str1 = i
       for q in df['Question']:
             str2 = q
             a = set(str1.split()) 
             b = set(str2.split())
             c = a.intersection(b)
             out.append({'Question': q,
                'Result': (float(len(c)) / (len(a) + len(b) - len(c)))})


new_df = pd.DataFrame(out, columns=['Question','Result'])
new_df.to_csv('output.csv', index=False, encoding='utf-8')

输出文件如下:

Question          Result
The sky is blue    1.0
The ocean is blue  0.6
The sky is blue    0.6
The ocean is blue  1.0

它确实会返回结果,现在,我想更改CSV输出以显示如下结果:

Question          The sky is blue The ocean is blue
The sky is blue    1.0             0.6
The ocean is blue  0.6             1.0

我试图更改代码并使用writerows,但我想我还是有所遗漏,谢谢。

1 个答案:

答案 0 :(得分:1)

defaultdictDataFrame构造函数一起使用:

from collections import defaultdict

out1 = defaultdict(dict)
for i in df['Question']:
       str1 = i
       for q in df['Question']:
             str2 = q
             a = set(str1.split()) 
             b = set(str2.split())
             c = a.intersection(b)
             out1[i][q] = (float(len(c)) / (len(a) + len(b) - len(c)))
print (out1)

df = pd.DataFrame(out1)
print (df)
                   The sky is blue  The ocean is blue
The ocean is blue              0.6                1.0
The sky is blue                1.0                0.6

带有DataFrame.pivot的原始解决方案:

out = []
for i in df['Question']:
       str1 = i
       for q in df['Question']:
             str2 = q
             a = set(str1.split()) 
             b = set(str2.split())
             c = a.intersection(b)
             out.append({'Question1': q, 'Question2': i,
                'Result': (float(len(c)) / (len(a) + len(b) - len(c)))})

df = pd.DataFrame(out).pivot('Question1', 'Question2', 'Result')
print (df)
Question2          The ocean is blue  The sky is blue
Question1                                            
The ocean is blue                1.0              0.6
The sky is blue                  0.6              1.0