我的问题有几个问题需要解决。我需要导入一个带有两个字段的csv文件(一个用作ID的数字字段和一个用作描述的字符串字段)。然后,将字符串字段转换为单个单词的集合(list?tuple?dict?)并搜索彼此集合以计算匹配。
示例:
id_field | desc_field
1 | some description
2 | some other description
3 | some third other description
我需要的是id_field匹配列表
id_field 1 has 2 matches in id_field 2
id_field 1 has 2 matches in id_field 3
id_field 2 has 3 matches in id_field 3
等
导入csv
文件应该很容易使用:
import csv
reader = csv.reader(open('SOMEFILE.csv'), delimiter=',', quotechar='"'
我确信我可以使用find或in运算符来查找和计算单词,但是我在编写代码时会遇到麻烦,这会让我搜索csv字符串字段。
答案 0 :(得分:0)
这应该这样做:
import csv
reader = csv.reader(open('SOMEFILE.csv'), delimiter=',', quotechar='"')
data = [[line[0], line[1].split()] for line in reader]
for no1, words1 in data:
for i in range(int(no1), len(data)):
no2, words2 = data[i][0], data[i][1]
matches = len(words1 + words2) - len(set(words1 + words2))
print 'id_field', no1, 'has', matches, 'matches in id_field', no2
如果您对代码有任何问题或疑问,请与我们联系。我假设你只想在你的例子中检查前锋,即在1检查匹配2和3时,在2时只检查3(如果有3行)。
如果您想排除零匹配的案例,您可以在打印前添加以下行并缩进打印:
if matched > 0:
答案 1 :(得分:0)
import csv
import itertools
import re
id_2_desc = {}
with open('SOMEFILE.csv') as csvfile:
reader = csv.reader(csvfile, delimiter='|')
for n, (id_field, desc_field) in enumerate(reader):
if n > 0:
id_2_desc[id_field.strip()] = desc_field.strip()
id_fields = id_2_desc.keys()
for id_field1, id_field2 in itertools.combinations(id_fields, 2):
desc_field1 = id_2_desc[id_field1]
desc_field2 = id_2_desc[id_field2]
desc_tokens1 = re.split('\s+', desc_field1)
desc_tokens2 = re.split('\s+', desc_field2)
matches = set(desc_tokens1) & set(desc_tokens2)
print 'id_field {} has {} matches in id_field {}'.format(id_field1, len(matches), id_field2)