我构建了以下代码来分析Jaccard的相似性:
import pandas as pd
import csv
df = pd.read_csv('data.csv', usecols=[0]
,names=['Question'],
encoding='utf-8')
out = []
for i in df['Question']:
str1 = i
for q in df['Question']:
str2 = q
a = set(str1.split())
b = set(str2.split())
c = a.intersection(b)
out.append({'Question': q,
'Result': (float(len(c)) / (len(a) + len(b) - len(c)))})
new_df = pd.DataFrame(out, columns=['Question','Result'])
new_df.to_csv('output.csv', index=False, encoding='utf-8')
输出文件如下:
Question Result
The sky is blue 1.0
The ocean is blue 0.6
The sky is blue 0.6
The ocean is blue 1.0
它确实会返回结果,现在,我想更改CSV输出以显示如下结果:
Question The sky is blue The ocean is blue
The sky is blue 1.0 0.6
The ocean is blue 0.6 1.0
我试图更改代码并使用writerows,但我想我还是有所遗漏,谢谢。
答案 0 :(得分:1)
将defaultdict
与DataFrame
构造函数一起使用:
from collections import defaultdict
out1 = defaultdict(dict)
for i in df['Question']:
str1 = i
for q in df['Question']:
str2 = q
a = set(str1.split())
b = set(str2.split())
c = a.intersection(b)
out1[i][q] = (float(len(c)) / (len(a) + len(b) - len(c)))
print (out1)
df = pd.DataFrame(out1)
print (df)
The sky is blue The ocean is blue
The ocean is blue 0.6 1.0
The sky is blue 1.0 0.6
带有DataFrame.pivot
的原始解决方案:
out = []
for i in df['Question']:
str1 = i
for q in df['Question']:
str2 = q
a = set(str1.split())
b = set(str2.split())
c = a.intersection(b)
out.append({'Question1': q, 'Question2': i,
'Result': (float(len(c)) / (len(a) + len(b) - len(c)))})
df = pd.DataFrame(out).pivot('Question1', 'Question2', 'Result')
print (df)
Question2 The ocean is blue The sky is blue
Question1
The ocean is blue 1.0 0.6
The sky is blue 0.6 1.0