我正在解析两个不同的csv文件,需要匹配它们之间的列。目前,当我运行代码段时,它实际上在两个csv文件之间存在匹配的地址时,它不返回匹配的值。我遇到的问题是OnlineData csv文件中的地址字段的缩写。例如:
In the Addresses csv In the OnlineData csv
4587 Newton Road 4587 Newton Rd
7854 Food Court 7854 Food Ct
如何告诉Python只查找 数字(' 4587')和第一个字(' Newton&#39 ;)在查找匹配值时在两个csv文件中。
import csv
Addresses = set()
with open ('Addresses.csv') as f:
for row in csv.reader(f):
Addresses.add(row[1])
OnlineData = set()
with open ('C:/Users/OnlineData.csv') as g:
for row in csv.reader(g):
PermitData.add(row[1])
results = Addresses & OnlineData
print 'There are', len(results), 'matching addresses between the two csv files'
for result in sorted(results):
print result
答案 0 :(得分:1)
由于您只对匹配部分数据感兴趣,因此您可以将该部分加载到set
中,然后执行交集。
import csv
Addresses = set()
with open ('Addresses.csv') as f:
for row in csv.reader(f):
portion = ' '.join(row[1].split()[:-1]) # Loads "4587 Newton" instead of "4587 Newton Road"
Addresses.add(portion)
OnlineData = set()
with open ('C:/Users/OnlineData.csv') as g:
for row in csv.reader(g):
portion = ' '.join(row[1].split()[:-1])
OnlineData.add(portion)
results = Addresses & OnlineData
print 'There are', len(results), 'matching addresses between the two csv files'
for result in sorted(results):
print result
明显的缺点是你丢失了一些你仍然可以检索的信息。另一种选择是规范化输入,这意味着您可以将Rd
替换为Road
,将Ct
替换为Court
,以便始终匹配信息。< / p>