我有2个csv文件格式如下:
csv A
Tweet1,pos
Tweet2,neg
Tweet2,neg
csv B
Tweet2,neg
Tweet2,neg
Tweet2,pos
我想找到原始数据之间的相似数
我尝试了这个,但它接缝给出了差异
def compare( fileA, fileB ):
a_file = open(fileA, 'r')
a_data = a_file.read()
a_file.close()
b_file = open(fileB, 'r')
b_data = b_file.read()
b_file.close()
# compare the contents
a_set = set(a_data.split(','))
b_set = set(b_data.split(','))
return list(a_set.intersection(b_set))
print compare('f.csv', 'full-corpus.csv')
输出应为1
答案 0 :(得分:0)
您可以尝试return len(a_set & b_set)
。 &
是查找所有集合中存在的元素的运算符,len
将是所有集合中存在的元素数量
答案 1 :(得分:0)
这样做,你只需要从集合中导入类计数器,然后将每个文件作为列表打开。
import csv
from collections import Counter
a_list = []
with open('1.csv', 'Ur') as a_file:
for line in csv.reader(a_file):
a_list.append(line[0]+' '+line[1])
print a_list
b_list = []
with open('2.csv', 'Ur') as b_file:
for line in csv.reader(b_file):
b_list.append(line[0]+' '+line[1])
print b_list
counterA = Counter(a_list)
counterB = Counter(b_list)
counterSum = counterB & counterA
print counterA
print counterB
print counterB & counterA
print sum(counterSum.values())